Re: Why is pq_begintypsend so slow? - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Why is pq_begintypsend so slow?
Date
Msg-id CA+TgmoYS+YHL85FK1iGnNCAq_i_uXt-oQHiFCJWaAVbmByVd_A@mail.gmail.com
Whole thread Raw
In response to Re: Why is pq_begintypsend so slow?  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Tue, Jun 2, 2020 at 9:56 PM Andres Freund <andres@anarazel.de> wrote:
> I don't know what the best non-gross solution for the overhead of the
> out/send functions is. There seems to be at least the following
> major options (and a lots of variants thereof):
>
> 1) Just continue to incur significant overhead for every datum
> 2) Accept the uglyness of passing in a buffer via
>    FunctionCallInfo->context. Change nearly all in-core send functions
>    over to that.
> 3) Pass string buffer through an new INTERNAL argument to send/output
>    function, allow both old/new style send functions. Use a wrapper
>    function to adapt the "old style" to the "new style".
> 4) Like 3, but make the new argument optional, and use ad-hoc
>    stringbuffer if not provided. I don't like the unnecessary branches
>    this adds.

I ran into this problem in another context today while poking at some
pg_basebackup stuff. There's another way of solving this problem which
I think we should consider: just get rid of the per-row stringinfo and
push the bytes directly from wherever they are into PqSendBuffer. Once
we start doing this, we can't error out, because internal_flush()
might've been called, sending a partial message to the client. Trying
to now switch to sending an ErrorResponse will break protocol sync.
But it seems possible to avoid that. Just call all of the output
functions first, and also do any required encoding conversions
(pq_sendcountedtext -> pg_server_to_client). Then, do a bunch of
pq_putbytes() calls to shove the message out -- there's the small
matter of an assertion failure, but whatever. This effectively
collapses two copies into one. Or alternatively, build up an array of
iovecs and then have a variant of pq_putmessage(), like
pq_putmessage_iovec(), that knows what to do with them.

One advantage of this approach is that it approximately doubles the
size of the DataRow message we can send. We're currently limited to
<1GB because of palloc, but the wire protocol just needs it to be <2GB
so that a signed integer does not overflow. It would be nice to buy
more than a factor of two here, but that would require a wire protocol
change, and 2x is not bad.

Another advantage of this approach is that it doesn't require passing
StringInfos all over the place. For the use case that I was looking
at, that appears awkward. I'm not saying I couldn't make it work, but
it wouldn't be my first choice. Right now, I've got data bubbling down
a chain of handlers until it eventually gets sent off to the client;
with your approach, I think I'd need to bubble buffers up and then
bubble data down, which seems quite a bit more complex.

A disadvantage of this approach is that we still end up doing three
copies: one from the datum to the per-datum StringInfo, a second into
PqSendBuffer, and a third from there to the kernel. However, we could
probably improve on this. Whenever we internal_flush(), consider
whether the chunk of data we're the process of copying (the current
call to pq_putbytes(), or the current iovec) has enough bytes
remaining to completely refill the buffer. If so, secure_write() a
buffer's worth of bytes (or more) directly, bypassing PqSendBuffer.
That way, we avoid individual system calls (or calls to OpenSSL or
GSS) for small numbers of bytes, but we also avoid extra copying when
transmitting larger amounts of data.

Even with that optimization, this still seems like it could end up
being less efficient than your proposal (surprise, surprise). If we've
got a preallocated buffer which we won't be forced to resize during
message construction -- and for DataRow messages we can get there just
by keeping the buffer around, so that we only need to reallocate when
we see a larger message than we've ever seen before -- and we write
all the data directly into that buffer and then send it from there
straight to the kernel, we only ever do 2 copies, whereas what I'm
proposing sometimes does 3 copies and sometimes only 2.

While I admit that's not great, it seems likely to still be a
significant win over what we have now, and it's a *lot* less invasive
than your proposal. Not only does your approach require changing all
of the type-output and type-sending functions inside and outside core
to use this new model, admittedly with the possibility of backward
compatibility, but it also means that we could need similarly invasive
changes in any other place that wants to use this new style of message
construction. You can't write any data anywhere that you might want to
later incorporate into a protocol message unless you write it into a
StringInfo; and not only that, but you have to be able to get the
right amount of data into the right place in the StringInfo right from
the start. I think that in some cases that will require fairly complex
orchestration.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: James Coleman
Date:
Subject: Re: Use of "long" in incremental sort code
Next
From: "Andrey M. Borodin"
Date:
Subject: Re: recovering from "found xmin ... from before relfrozenxid ..."