Thread: Proposal: PqSendBuffer removal

Proposal: PqSendBuffer removal

From
Aleksei Ivanov
Date:
Dear community,

I am really curious what was the original intention of using the PqSendBuffer and is it possible to remove it now.

Currently all messages are copied from StringInfo to this buffer and sent, which from my point of view is redundant operation.
It is possible to directly send messages from StringInfo to client. For example: allocate more bytes from the beginning and fill out it before sent to client.

Maybe there was already discussion about it or if I missing something please fill free to correct me.

Thank you in advance!

Re: Proposal: PqSendBuffer removal

From
Tom Lane
Date:
Aleksei Ivanov <iv.alekseii@gmail.com> writes:
> I am really curious what was the original intention of using the
> PqSendBuffer and is it possible to remove it now.

> Currently all messages are copied from StringInfo to this buffer and sent,
> which from my point of view is redundant operation.

That would mean doing a separate send() kernel call for every few bytes,
no?  I think the point of that buffer is to be sure we accumulate a
reasonable number of bytes to pass to the kernel for each send().

            regards, tom lane



Re: Proposal: PqSendBuffer removal

From
Aleksei Ivanov
Date:
Thank you for your reply!

Yes, you are right there will be a separate call to send the data, but is copying data each time more costly operation than just one syscall?

Besides, if we already have a ready message packet to be sent why should we wait?

Waiting for your reply, 
Best regards!



On Thu, Mar 5, 2020 at 13:10 Tom Lane <tgl@sss.pgh.pa.us> wrote:
Aleksei Ivanov <iv.alekseii@gmail.com> writes:
> I am really curious what was the original intention of using the
> PqSendBuffer and is it possible to remove it now.

> Currently all messages are copied from StringInfo to this buffer and sent,
> which from my point of view is redundant operation.

That would mean doing a separate send() kernel call for every few bytes,
no?  I think the point of that buffer is to be sure we accumulate a
reasonable number of bytes to pass to the kernel for each send().

                        regards, tom lane

Re: Proposal: PqSendBuffer removal

From
Tom Lane
Date:
Aleksei Ivanov <iv.alekseii@gmail.com> writes:
> Yes, you are right there will be a separate call to send the data, but is
> copying data each time more costly operation than just one syscall?

What do you mean "just one syscall"?  The entire point here is that it'd
take more syscalls to send the same amount of data.

It does strike me that with the v3 protocol, we do sometimes have cases
where internal_putbytes is reached with a fairly large "len".  If we've
flushed out what's in PqSendBuffer to start with, and there's more than
a bufferload remaining in the source data, we could send the source
data directly to output without copying it to the buffer first.
That could actually result in *fewer* kernel calls not more, if "len"
is large enough.  But I strongly doubt that a code path that nets
out to more kernel calls will win.

            regards, tom lane



Re: Proposal: PqSendBuffer removal

From
Aleksei Ivanov
Date:
> What do you mean "just one syscall"?  The entire point here is that it'd take more syscalls to send the same amount of data.

I mean that it messages are large enough more than 2K we will need 4 syscalls without copy it to the internal buffer, but currently we will copy 8K of messages and send it using 1 call. I think that under some threshold of packet length it is redundant to copy it to internal buffer and the data can be sent directly.

> It does strike me that with the v3 protocol, we do sometimes have cases
where internal_putbytes is reached with a fairly large "len".  If we've
flushed out what's in PqSendBuffer to start with, and there's more than
a bufferload remaining in the source data, we could send the source
data directly to output without copying it to the buffer first.
That could actually result in *fewer* kernel calls not more, if "len"
is large enough.  But I strongly doubt that a code path that nets
out to more kernel calls will win.


Yes, internal_putbytes can be updated to send data directly if the length is more than internal buffer size.

On Thu, Mar 5, 2020 at 15:04 Tom Lane <tgl@sss.pgh.pa.us> wrote:
Aleksei Ivanov <iv.alekseii@gmail.com> writes:
> Yes, you are right there will be a separate call to send the data, but is
> copying data each time more costly operation than just one syscall?

What do you mean "just one syscall"?  The entire point here is that it'd
take more syscalls to send the same amount of data.

It does strike me that with the v3 protocol, we do sometimes have cases
where internal_putbytes is reached with a fairly large "len".  If we've
flushed out what's in PqSendBuffer to start with, and there's more than
a bufferload remaining in the source data, we could send the source
data directly to output without copying it to the buffer first.
That could actually result in *fewer* kernel calls not more, if "len"
is large enough.  But I strongly doubt that a code path that nets
out to more kernel calls will win.

                        regards, tom lane

Re: Proposal: PqSendBuffer removal

From
Andres Freund
Date:
Hi,

On March 5, 2020 1:23:21 PM PST, Aleksei Ivanov <iv.alekseii@gmail.com> wrote:
>Thank you for your reply!
>
>Yes, you are right there will be a separate call to send the data, but
>is
>copying data each time more costly operation than just one syscall?

Yes, it's very likely to be more expensive to execute a syscall in a lot of cases. They've gotten a lot more expensive
withall the security issues.  

>Besides, if we already have a ready message packet to be sent why
>should we
>wait?

In a lot of cases we'll send a number of small messages after each other. We don't want to send those out separately,
that'djust increase overhead. 


But in some paths/workloads the copy is quite noticable. I've mused before whether we could extend StringInfo to handle
caseslike this. E.g. by having StringInfo have two lengths. One that is the offset to the start of the allocated memory
(0for plain StringInfos), and one for the length of the string being built. 

Then we could get a StringInfo pointing directly to the current insertion point in the send buffer.  To support growing
it,enlargeStringInfo would first subtract the offset to the start of the allocation, and then reallocate that.  

I can imagine that bring useful in a number of places. And because there only would be additional overhead when
actuallygrowing the StringInfo, I don't think the cost would be measurable. 

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: Proposal: PqSendBuffer removal

From
Craig Ringer
Date:
On Fri, 6 Mar 2020 at 07:27, Aleksei Ivanov <iv.alekseii@gmail.com> wrote:
>
> > What do you mean "just one syscall"?  The entire point here is that it'd take more syscalls to send the same amount
ofdata. 
>
> I mean that it messages are large enough more than 2K we will need 4 syscalls without copy it to the internal buffer,
butcurrently we will copy 8K of messages and send it using 1 call. I think that under some threshold of packet length
itis redundant to copy it to internal buffer and the data can be sent directly. 

I think what you're suggesting is more complex than you may expect.
PostgreSQL is single threaded and relies pretty heavily on the ability
to buffer internally. It also expects its network I/O to always
succeed. Just switching to directly doing nonblocking I/O is not very
feasible. Changing the network I/O paths may expose a lot more
opportunities for send vs receive deadlocks.

It also complicates the protocol's handling of message boundaries,
since failures and interruptions can occur at more points.

Have you measured anything that suggests that our admittedly
inefficient multiple handling of send buffers is
performance-significant compared to the vast amount of memory
allocation and copying we do all over the place elsewhere? Do you have
a concrete reason to want to remove this?

If I had to change this model I'd probably be looking at an
iovector-style approach, like we use with shm_mq. Assemble an array of
buffer descriptors pointing to short, usually statically allocated
buffers and populate one with each pqformat step. Then when the
message is assembled, use writev(2) or similar to dispatch it. Maybe
do some automatic early flushing if the buffer space overflows. But
that might need a protocol extension so we had a way to recover after
interrupted sending of a partial message...

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise



Re: Proposal: PqSendBuffer removal

From
Aleksei Ivanov
Date:
>Then we could get a StringInfo pointing directly to the current insertion point in the send buffer.  To support growing it, enlargeStringInfo would first subtract the offset to the start of the allocation, and then reallocate that

I thought about it yesterday and one issue with this approach is how would you known the length of the packet to be sent. As we can’t returned back in PqSendBuffer. Also realloc is quite expensive operation.

Previously I suggested to include offset into stringinfo and if it is large enough we will have an opportunity to send it directly and it will not required a lot of changes.


On Fri, Mar 6, 2020 at 10:45 Andres Freund <andres@anarazel.de> wrote:
Hi,

On March 5, 2020 1:23:21 PM PST, Aleksei Ivanov <iv.alekseii@gmail.com> wrote:
>Thank you for your reply!
>
>Yes, you are right there will be a separate call to send the data, but
>is
>copying data each time more costly operation than just one syscall?

Yes, it's very likely to be more expensive to execute a syscall in a lot of cases. They've gotten a lot more expensive with all the security issues.

>Besides, if we already have a ready message packet to be sent why
>should we
>wait?

In a lot of cases we'll send a number of small messages after each other. We don't want to send those out separately, that'd just increase overhead.


But in some paths/workloads the copy is quite noticable. I've mused before whether we could extend StringInfo to handle cases like this. E.g. by having StringInfo have two lengths. One that is the offset to the start of the allocated memory (0 for plain StringInfos), and one for the length of the string being built.

Then we could get a StringInfo pointing directly to the current insertion point in the send buffer.  To support growing it, enlargeStringInfo would first subtract the offset to the start of the allocation, and then reallocate that.

I can imagine that bring useful in a number of places. And because there only would be additional overhead when actually growing the StringInfo, I don't think the cost would be measurable.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Proposal: PqSendBuffer removal

From
Andres Freund
Date:
Hi,

On 2020-03-06 11:09:23 -0800, Aleksei Ivanov wrote:
> *>Then we could get a StringInfo pointing directly to the current insertion
> point in the send buffer.  To support growing it, enlargeStringInfo would
> first subtract the offset to the start of the allocation, and then
> reallocate that*.
>
> I thought about it yesterday and one issue with this approach is how would
> you known the length of the packet to be sent. As we can’t returned back in
> PqSendBuffer. Also realloc is quite expensive operation.

Could you expand a bit on what you see as the problem? Because I'm not
following?

What does any of this have to do with realloc performance? I mean, the
buffer would just scale up once, so the cost of that would be very
quickly amortized?

What I'm thinking is that we'd have pg_beginmessage() (potentially a
different named version) initialize the relevant StringInfo basically as

(StringInfoData){
    .data = PqSendPointer,
    .len = 0,
    .alloc_offset = PqSendBuffer - PqSendBuffer
}

and that pq_endmessage would then advance the equivalent (see below [1]) of
what today is PqSendPointer to be PqSendPointer += StringInfo->len;

That'd mean that we'd never need to copy data in/out the send buffer
anymore, because we'd directly construct the message in the send
buffer. Pretty much all important FE/BE communication goes through
pq_beginmessage[_reuse()].

We'd have to add some defenses against building multiple messages at the
same time. But neither do I think that is common, nor does it seem hard
to defend againt: A simple counter should do the trick?

Regards,
Andres


[1] Obviously the above sketch doesn't quite work that way. We can't
just have stringinfo reallocate the buffer. Feels quite solvable though.



Re: Proposal: PqSendBuffer removal

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> What I'm thinking is that we'd have pg_beginmessage() (potentially a
> different named version) initialize the relevant StringInfo basically as

> (StringInfoData){
>     .data = PqSendPointer,
>     .len = 0,
>     .alloc_offset = PqSendBuffer - PqSendBuffer
> }

This seems way overcomplicated compared to what I suggested (ie,
just let internal_putbytes bypass the buffer in cases where the
data would get flushed next time through its loop anyway).
What you're suggesting would be a lot more invasive and restrictive
--- for example, why is it a good idea to have a hard-wired
assumption that we can't build more than one message at once?

I'm also concerned that trying to do too much optimization here will
break one of the goals of the existing code, which is to not get into
a situation where an OOM failure causes a wire protocol violation
because we've already sent part of a message but are no longer able to
send the rest of it.  To ensure that doesn't happen, we must construct
the whole message before we start to send it, and we can't let
buffering of the last message be too entwined with construction of the
next one.  Between that and the (desirable) arms-length separation
between datatype I/O functions and the actual I/O, a certain amount of
data copying seems unavoidable.

            regards, tom lane



Re: Proposal: PqSendBuffer removal

From
Andres Freund
Date:
Hi,

On 2020-03-07 13:54:57 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > What I'm thinking is that we'd have pg_beginmessage() (potentially a
> > different named version) initialize the relevant StringInfo basically as
> 
> > (StringInfoData){
> >     .data = PqSendPointer,
> >     .len = 0,
> >     .alloc_offset = PqSendBuffer - PqSendBuffer
> > }
> 
> This seems way overcomplicated compared to what I suggested (ie,
> just let internal_putbytes bypass the buffer in cases where the
> data would get flushed next time through its loop anyway).

Well, we quite frequently send out multiple messages in a row, without a
flush inbetween. It'd be nice if we could avoid both copying buffers for
each message, as well as allocating a new stringinfo.

We've reduced the number of wholesale stringinfo reallocations with
pq_beginmessage_reuse(), which is e.g. significant when actually
returning tuples, and that was a noticable performance improvement.

I don't believe that the copy is a performance relevant factor solely
for messages that are individually too large to fit in the send
buffer. For one, there'll often be some pending send data from a
previous "round", which'd mean we'd need to call send() more often, or
use vectorized IO (i.e. switch to writev()). But also,


> What you're suggesting would be a lot more invasive and restrictive
> --- for example, why is it a good idea to have a hard-wired
> assumption that we can't build more than one message at once?

Well, we don't seem to have many (any?) places where that's not the
case. And having to use only one layer of buffering for outgoing data
does seem advantageous to me.  It'd not be hard to fall back to a
separate buffer just for the cases where there are multiple messages
built concurrently, if we want to support that.


> I'm also concerned that trying to do too much optimization here will
> break one of the goals of the existing code, which is to not get into
> a situation where an OOM failure causes a wire protocol violation
> because we've already sent part of a message but are no longer able to
> send the rest of it.  To ensure that doesn't happen, we must construct
> the whole message before we start to send it, and we can't let
> buffering of the last message be too entwined with construction of the
> next one.  Between that and the (desirable) arms-length separation
> between datatype I/O functions and the actual I/O, a certain amount of
> data copying seems unavoidable.

Sure. But I don't see why that requires two levels of buffering for
messages? If we were to build the message in the output buffer, resizing
as needed, we can send the data once the message is complete, or not at
all.

I don't think anything on the datatype I/O level would be affected?

While I think it'd be quite desirable to avoid e.g. the separate
stringinfo allocation for send functions, I think that's quite a
separate project. One which I have no really good idea to tackle.

Greetings,

Andres Freund


[1] Since I had looked it up:

We do a separate message for each of:
1) result description
2) each result row
3) ReadyForQuery

And we separately call through PQcommMethods for things like
pq_putemptymessage() and uses of pq_putmessage() not going through
pq_endmessage. The former is called a lot, especially when using the
extended query protocol (which we want clients to use!).


For a SELECT 1 in the simple protocol we end up calling putmessage via:
1) SendRowDescriptionMessage
2) printtup()
3) EndCommand()
4) ReadyForQuery()

For extended:
1) exec_parse_message()
2) exec_bind_message()
3) exec_describe_portal_message()
4) printtup()
5) EndCommand()
6) ReadyForQuery()



Re: Proposal: PqSendBuffer removal

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2020-03-07 13:54:57 -0500, Tom Lane wrote:
>> This seems way overcomplicated compared to what I suggested (ie,
>> just let internal_putbytes bypass the buffer in cases where the
>> data would get flushed next time through its loop anyway).

> Well, we quite frequently send out multiple messages in a row, without a
> flush inbetween. It'd be nice if we could avoid both copying buffers for
> each message, as well as allocating a new stringinfo.

That gets you right into the situation where trouble adding more messages
could corrupt/destroy messages that were supposedly already sent (but in
reality aren't flushed to the client quite yet).  I really think that
there is not enough win available here to justify introducing that kind
of fragility.

To be blunt, no actual evidence has been offered in this thread that
it's worth changing anything at all in this area.  All of the bytes
in question eventually have to be delivered to the client, which is
going to involve two kernel-space/user-space copy steps along with
who-knows-what network transmission overhead.  The idea that an
extra process-local memcpy or two is significant compared to that
seems like mostly wishful thinking to me.

            regards, tom lane



Re: Proposal: PqSendBuffer removal

From
Craig Ringer
Date:
faifaiOn Sat, 7 Mar 2020 at 02:45, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On March 5, 2020 1:23:21 PM PST, Aleksei Ivanov <iv.alekseii@gmail.com> wrote:
> >Thank you for your reply!
> >
> >Yes, you are right there will be a separate call to send the data, but
> >is
> >copying data each time more costly operation than just one syscall?
>
> Yes, it's very likely to be more expensive to execute a syscall in a lot of cases. They've gotten a lot more
expensivewith all the security issues. 

Good to know.

> >Besides, if we already have a ready message packet to be sent why
> >should we
> >wait?
>
> In a lot of cases we'll send a number of small messages after each other. We don't want to send those out separately,
that'djust increase overhead. 

Right. We presently set `TCP_NODELAY` to disable Nagle's algorithm, so
we'd tend to generate these messages as separate packets as well as
seperate syscalls, making it doubly undesirable.

We can dynamically twiddle TCP_NODELAY like many other applications do
to optimise the wire packets, but not the syscalls. It'd actually cost
extra syscalls.

> But in some paths/workloads the copy is quite noticable. I've mused before whether we could extend StringInfo to
handlecases like this. E.g. by having StringInfo have two lengths. One that is the offset to the start of the allocated
memory(0 for plain StringInfos), and one for the length of the string being built. 
>
> Then we could get a StringInfo pointing directly to the current insertion point in the send buffer.  To support
growingit, enlargeStringInfo would first subtract the offset to the start of the allocation, and then reallocate that. 
>
> I can imagine that bring useful in a number of places. And because there only would be additional overhead when
actuallygrowing the StringInfo, I don't think the cost would be measurable. 

That sounds pretty sensible as it'd be minimally intrusive.

I've wondered whether we can further optimise some uses by having
libpq-be manage an iovec for us instead, much like we support iovec
for shm_mq. Instead of a StringInfo we'd use an iovec wrapped by a
libpq-managed struct. libpq would reserve the first few entries in the
managed iovec for the message header. Variants of pq_sendint64(...)
etc would add entries to the iovec and could be inlined since they'd
just be convenience routines. The message would be flushed by having
libpq call writev() on the iovec container.

We'd want a wrapper struct for the iovec so we could have libpq keep a
cursor for the next entry in the iovec. For libpq-fe it'd also contain
a map of which iovec entries need to be free()d; for libpq-be we'd
probably palloc(), maybe with a child memory context. To support a
stack-allocated iovec for when we know all the message fields in
advance we'd have an init function that takes the address  of the
preallocated iovec and its length limit.

We could support  We could also support a libpq-wrapped iovec where
libpq can realloc it if it fills.
stack-allocated iovec - so the caller would be responsible for
managing the max size


That way we can do zero-copy scatter-gather I/O for messages that
don't require binary-to-text-format transformations etc.

BTW, if we change StringInfo, I'd like to also official bless the
usage pattern where we wrap a buffer in a StringInfo so we can use
pq_getmsgint64 etc on it. Add a initConstStringInfo(StringInfo si,
const char * buf, size_t buflen) or something that assigns the
StringInfo values and sets maxlen = -1.  The only in-core usage I see
for this so far is in src/backend/replication/logical/worker.c but
it's used extremely heavily in pglogical etc. It'd just be a
convenience function that blesses and documents existing usage.

But like Tom I really want to first come back to the evidence. Why
should we bother? Are we solving an actual problem here? PostgreSQL is
immensely memory-allocation-happy and copy-happy. Shouldn't we be more
interested in things like reducing the cost of multiple copies and
transform passes of Datum values? Especially since that's an actual
operational pain point when you're working with multi-hundred-megabyte
bytea or text fields.

Can you come up with some profiling/performance numbers that track
time spent on memory copying in the areas you propose to target, plus
malloc overheads? With a tool like systemtap or perf it should not be
overly difficult to do so by making targeted probes that filter based
on callstack, or on file / line-range or function.