Re: RFC: Async query processing - Mailing list pgsql-hackers

From Claudio Freire
Subject Re: RFC: Async query processing
Date
Msg-id CAGTBQpZ7sRObEhTZ6ouXkA=ZgZDOfXoK7Vimy1MA_PMoVmDN8Q@mail.gmail.com
Whole thread Raw
In response to Re: RFC: Async query processing  (Florian Weimer <fweimer@redhat.com>)
Responses Re: RFC: Async query processing
List pgsql-hackers
On Wed, Dec 18, 2013 at 1:50 PM, Florian Weimer <fweimer@redhat.com> wrote:
> On 11/04/2013 02:51 AM, Claudio Freire wrote:
>>
>> On Sun, Nov 3, 2013 at 3:58 PM, Florian Weimer <fweimer@redhat.com> wrote:
>>>
>>> I would like to add truly asynchronous query processing to libpq,
>>> enabling
>>> command pipelining.  The idea is to to allow applications to auto-tune to
>>> the bandwidth-delay product and reduce the number of context switches
>>> when
>>> running against a local server.
>>
>> ...
>>>
>>> If the application is not interested in intermediate query results, it
>>> would
>>> use something like this:
>>
>> ...
>>>
>>> If there is no need to exit from the loop early (say, because errors are
>>> expected to be extremely rare), the PQgetResultNoWait call can be left
>>> out.
>>
>>
>> It doesn't seem wise to me making such a distinction. It sounds like
>> you're oversimplifying, and that's why you need "modes", to overcome
>> the evidently restrictive limits of the simplified interface, and that
>> it would only be a matter of (a short) time when some other limitation
>> requires some other mode.
>
>
> I need modes because I want to avoid unbound buffering, which means that
> result data has to be consumed in the order queries are issued.
...
> In any case, I don't want to change the wire protocol, I just want to enable
> libpq clients to use more of its capabilities.

I believe you will at least need to use TCP_CORK or some advanced
socket options if you intend to decrease the number of packets without
changing the protocol.

Due to the interactive and synchronized nature of the protocol, TCP
will immediately send the first query in a packet since it's already
ready to do so. Buffering will only happen from the second query
onwards, and this won't benefit a two-query loop as the one in your
sample.

As for expectations, they can be part of the connection object and not
the wire protocol if you wish. The point I was making, is that the
expectation should be part of the query call, since that's less error
prone than setting a "discard results" mode. Think of it as
PQsendQueryParams with an extra "async" argument that defaults to
PQASYNC_NOT (ie, sync). There you can tell libpq to expect either no
results, expect and discard them, or whatever. The benefit here is a
simplified usage: your example code will be part of libpq and thus all
this complexity will be hidden from users. Furthermore, libpq will do
the small sanity check of actually checking that the server returns no
results when expecting no result.

>>>    PGAsyncMode oldMode = PQsetsendAsyncMode(conn, PQASYNC_RESULT);
>>>    bool more_data;
>>>    do {
>>>       more_data = ...;
>>>       if (more_data) {
>>>         int ret = PQsendQueryParams(conn,
>>>           "INSERT ... RETURNING ...", ...);
>>>         if (ret == 0) {
>>>           // handle low-level error
>>>         }
>>>       }
>>>       // Consume all pending results.
>>>       while (1) {
>>>         PGresult *res;
>>>         if (more_data) {
>>>           res = PQgetResultNoWait(conn);
>>>         } else {
>>>           res = PQgetResult(conn);
>>>         }
>>
>>
>> Somehow, that code looks backwards. I mean, really backwards. Wouldn't
>> that be !more_data?
>
> No, if more data is available to transfer to the server, the no-wait variant
> has to be used to avoid a needless synchronization with the server.

Ok, yeah. Now I get it. It's client-side more_data.

>> In any case, pipelining like that, without a clear distinction, in the
>> wire protocol, of which results pertain to which query, could be a
>> recipe for trouble when subtle bugs, either in lib usage or
>> implementation, mistakenly treat one query's result as another's.
>
>
> We already use pipelining in libpq (see pqFlush, PQsendQueryGuts and
> pqParseInput3), the server is supposed to support it, and there is a lack of
> a clear tit-for-tat response mechanism anyway because of NOTIFY/LISTEN and
> the way certain errors are reported.

pqFlush doesn't seem overly related, since the API specifically states
that you cannot queue multiple PQsendQuery. It looks more like
low-level buffering. Ie: when the command itself is larger than the os
buffer and nonblocking operation requires multiple send() calls for
one PQsendQuery. Am I wrong?

>>> Instead of buffering the results, we could buffer the encoded command
>>> messages in PQASYNC_RESULT mode.  This means that PQsendQueryParams would
>>> not block when it cannot send the (complete) command message, but store
>>> in
>>> the connection object so that the subsequent PQgetResultNoWait and
>>> PQgetResult would send it.  This might work better with single-tuple
>>> result
>>> mode.  We cannot avoid buffering either multiple queries or multiple
>>> responses if we want to utilize the link bandwidth, or we'd risk
>>> deadlocks.
>>
>>
>> This is a non-solution. Such an implementation, at least as described,
>> would not remove neither network latency nor context switches, it
>> would be a purely API change with no externally visible behavior
>> change.
>
>
> Ugh, why?

Oh, sorry. I had this elaborate answer prepared, but I just noticed
it's wrong: you do say "if it cannot send it rightaway".

So yes, I guess that's quite similar to the kind of buffering I was
talking about anyway.

Still, I'd suggest using TCP_CORK when expecting this kind of usage
pattern, or the first call in your example won't buffer at all. It's
essentially the TCP slow-start issue, unless you've got a great many
queries to pipeline, you won't see the benefit without careful use of
TCP_CORK.

Since TCP_CORK is quite platform-dependent, I'd recommend "corking" on
the library side rather than trusting the network stack.


>> An effective solution must include multi-command packets. Without
>> knowing the wire protocol in detail, something like:
>>
>> PARSE: INSERT blah
>> BIND: args
>> EXECUTE with DISCARD
>> PARSE: INSERT blah
>> BIND: args
>> EXECUTE with DISCARD
>> PARSE: SELECT  blah
>> BIND: args
>> EXECUTE with FETCH ALL
>>
>> All in one packet, would be efficient and error-free (IMO).
>
>
> No, because this doesn't scale automatically with the bandwidth-delay
> product.  It also requires that the client buffers queries and their
> parameters even though the network has to do that anyway.

Why not? I'm talking about transport-level packets, btw, not libpq
frames/whatever.

Yes, the network stack will sometimes do that. But the it doesn't have
to do it. It does it sometimes, which is not the same.

And buffering algorithms are quite platform-dependent anyway, so it's
not the best idea to make libpq highly reliant on them.

But yes. You would get the benefit for large number of queries.

Launch a tcpdump and test it. This is a simple test in the loopback
interface, with python.

On the server:

>>> import socket
>>> s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>>> s.bind(('',8000))
>>> s.listen(10)
>>> s2 = s.accept()[0]
>>> s2.recv(256)
'hola mundo\n'

On the client:

>>> import socket
>>> s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>>> s.connect(('127.0.0.1',8000))
>>> s.send('hola') ; s.send(' mundo\n')
4
7

Tcpdump output:

15:33:16.112991 IP localhost.49138 > localhost.irdmi: Flags [S], seq
2768629731, win 43690, options [mss 65495,sackOK,TS val 3152304 ecr
0,nop,wscale 7], length 0
E..<.S@.@.Ff...........@.............0.........
.0..........
15:33:16.113004 IP localhost.irdmi > localhost.49138: Flags [S.], seq
840184739, ack 2768629732, win 43690, options [mss 65495,sackOK,TS val
3152304 ecr 3152304,nop,wscale 7], length 0
E..<..@.@.<..........@..2.3..........0.........
.0...0......
15:33:16.113016 IP localhost.49138 > localhost.irdmi: Flags [.], ack
1, win 342, options [nop,nop,TS val 3152304 ecr 3152304], length 0
E..4.T@.@.Fm...........@....2.3....V.(.....
.0...0..
15:34:32.843626 IP localhost.49138 > localhost.irdmi: Flags [P.], seq
1:5, ack 1, win 342, options [nop,nop,TS val 3229034 ecr 3152304],
length 4
E..8.U@.@.Fh...........@....2.3....V.,.....
.1Ej.0..hola
15:34:32.843675 IP localhost.irdmi > localhost.49138: Flags [.], ack
5, win 342, options [nop,nop,TS val 3229035 ecr 3229034], length 0
E..4*.@.@............@..2.3........V.(.....
.1Ek.1Ej
15:34:32.843696 IP localhost.49138 > localhost.irdmi: Flags [P.], seq
5:12, ack 1, win 342, options [nop,nop,TS val 3229035 ecr 3229035],
length 7
E..;.V@.@.Fd...........@....2.3....V./.....
.1Ek.1Ek mundo

15:34:32.843701 IP localhost.irdmi > localhost.49138: Flags [.], ack
12, win 342, options [nop,nop,TS val 3229035 ecr 3229035], length 0
E..4*.@.@............@..2.3........V.(.....
.1Ek.1Ek

See how there's two packets and two ack.

On eth, it's the same. Except the server doesn't even get the whole
"hola mundo", but just the first "hola", on the first recv, because of
network delay.

So, trusting the network start to do the quick start won't work. For
steady streams of queries, it will work. But not for short bursts,
which will be the most heavily used case I believe (most apps create
short bursts of inserts and not continuous streams at full bandwidth).



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: fix_PGSTAT_NUM_TABENTRIES_macro patch
Next
From: Robert Haas
Date:
Subject: Re: preserving forensic information when we freeze