Thread: Proposal: PqSendBuffer removal
Dear community,
I am really curious what was the original intention of using the PqSendBuffer and is it possible to remove it now.
Currently all messages are copied from StringInfo to this buffer and sent, which from my point of view is redundant operation.
It is possible to directly send messages from StringInfo to client. For example: allocate more bytes from the beginning and fill out it before sent to client.
Maybe there was already discussion about it or if I missing something please fill free to correct me.
Thank you in advance!
Aleksei Ivanov <iv.alekseii@gmail.com> writes: > I am really curious what was the original intention of using the > PqSendBuffer and is it possible to remove it now. > Currently all messages are copied from StringInfo to this buffer and sent, > which from my point of view is redundant operation. That would mean doing a separate send() kernel call for every few bytes, no? I think the point of that buffer is to be sure we accumulate a reasonable number of bytes to pass to the kernel for each send(). regards, tom lane
Thank you for your reply!
Yes, you are right there will be a separate call to send the data, but is copying data each time more costly operation than just one syscall?
Besides, if we already have a ready message packet to be sent why should we wait?
Waiting for your reply,
Best regards!
On Thu, Mar 5, 2020 at 13:10 Tom Lane <tgl@sss.pgh.pa.us> wrote:
Aleksei Ivanov <iv.alekseii@gmail.com> writes:
> I am really curious what was the original intention of using the
> PqSendBuffer and is it possible to remove it now.
> Currently all messages are copied from StringInfo to this buffer and sent,
> which from my point of view is redundant operation.
That would mean doing a separate send() kernel call for every few bytes,
no? I think the point of that buffer is to be sure we accumulate a
reasonable number of bytes to pass to the kernel for each send().
regards, tom lane
Aleksei Ivanov <iv.alekseii@gmail.com> writes: > Yes, you are right there will be a separate call to send the data, but is > copying data each time more costly operation than just one syscall? What do you mean "just one syscall"? The entire point here is that it'd take more syscalls to send the same amount of data. It does strike me that with the v3 protocol, we do sometimes have cases where internal_putbytes is reached with a fairly large "len". If we've flushed out what's in PqSendBuffer to start with, and there's more than a bufferload remaining in the source data, we could send the source data directly to output without copying it to the buffer first. That could actually result in *fewer* kernel calls not more, if "len" is large enough. But I strongly doubt that a code path that nets out to more kernel calls will win. regards, tom lane
> What do you mean "just one syscall"? The entire point here is that it'd take more syscalls to send the same amount of data.
I mean that it messages are large enough more than 2K we will need 4 syscalls without copy it to the internal buffer, but currently we will copy 8K of messages and send it using 1 call. I think that under some threshold of packet length it is redundant to copy it to internal buffer and the data can be sent directly.
> It does strike me that with the v3 protocol, we do sometimes have cases
where internal_putbytes is reached with a fairly large "len". If we've
flushed out what's in PqSendBuffer to start with, and there's more than
a bufferload remaining in the source data, we could send the source
data directly to output without copying it to the buffer first.
That could actually result in *fewer* kernel calls not more, if "len"
is large enough. But I strongly doubt that a code path that nets
out to more kernel calls will win.
where internal_putbytes is reached with a fairly large "len". If we've
flushed out what's in PqSendBuffer to start with, and there's more than
a bufferload remaining in the source data, we could send the source
data directly to output without copying it to the buffer first.
That could actually result in *fewer* kernel calls not more, if "len"
is large enough. But I strongly doubt that a code path that nets
out to more kernel calls will win.
Yes, internal_putbytes can be updated to send data directly if the length is more than internal buffer size.
On Thu, Mar 5, 2020 at 15:04 Tom Lane <tgl@sss.pgh.pa.us> wrote:
Aleksei Ivanov <iv.alekseii@gmail.com> writes:
> Yes, you are right there will be a separate call to send the data, but is
> copying data each time more costly operation than just one syscall?
What do you mean "just one syscall"? The entire point here is that it'd
take more syscalls to send the same amount of data.
It does strike me that with the v3 protocol, we do sometimes have cases
where internal_putbytes is reached with a fairly large "len". If we've
flushed out what's in PqSendBuffer to start with, and there's more than
a bufferload remaining in the source data, we could send the source
data directly to output without copying it to the buffer first.
That could actually result in *fewer* kernel calls not more, if "len"
is large enough. But I strongly doubt that a code path that nets
out to more kernel calls will win.
regards, tom lane
Hi, On March 5, 2020 1:23:21 PM PST, Aleksei Ivanov <iv.alekseii@gmail.com> wrote: >Thank you for your reply! > >Yes, you are right there will be a separate call to send the data, but >is >copying data each time more costly operation than just one syscall? Yes, it's very likely to be more expensive to execute a syscall in a lot of cases. They've gotten a lot more expensive withall the security issues. >Besides, if we already have a ready message packet to be sent why >should we >wait? In a lot of cases we'll send a number of small messages after each other. We don't want to send those out separately, that'djust increase overhead. But in some paths/workloads the copy is quite noticable. I've mused before whether we could extend StringInfo to handle caseslike this. E.g. by having StringInfo have two lengths. One that is the offset to the start of the allocated memory (0for plain StringInfos), and one for the length of the string being built. Then we could get a StringInfo pointing directly to the current insertion point in the send buffer. To support growing it,enlargeStringInfo would first subtract the offset to the start of the allocation, and then reallocate that. I can imagine that bring useful in a number of places. And because there only would be additional overhead when actuallygrowing the StringInfo, I don't think the cost would be measurable. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Fri, 6 Mar 2020 at 07:27, Aleksei Ivanov <iv.alekseii@gmail.com> wrote: > > > What do you mean "just one syscall"? The entire point here is that it'd take more syscalls to send the same amount ofdata. > > I mean that it messages are large enough more than 2K we will need 4 syscalls without copy it to the internal buffer, butcurrently we will copy 8K of messages and send it using 1 call. I think that under some threshold of packet length itis redundant to copy it to internal buffer and the data can be sent directly. I think what you're suggesting is more complex than you may expect. PostgreSQL is single threaded and relies pretty heavily on the ability to buffer internally. It also expects its network I/O to always succeed. Just switching to directly doing nonblocking I/O is not very feasible. Changing the network I/O paths may expose a lot more opportunities for send vs receive deadlocks. It also complicates the protocol's handling of message boundaries, since failures and interruptions can occur at more points. Have you measured anything that suggests that our admittedly inefficient multiple handling of send buffers is performance-significant compared to the vast amount of memory allocation and copying we do all over the place elsewhere? Do you have a concrete reason to want to remove this? If I had to change this model I'd probably be looking at an iovector-style approach, like we use with shm_mq. Assemble an array of buffer descriptors pointing to short, usually statically allocated buffers and populate one with each pqformat step. Then when the message is assembled, use writev(2) or similar to dispatch it. Maybe do some automatic early flushing if the buffer space overflows. But that might need a protocol extension so we had a way to recover after interrupted sending of a partial message... -- Craig Ringer http://www.2ndQuadrant.com/ 2ndQuadrant - PostgreSQL Solutions for the Enterprise
>Then we could get a StringInfo pointing directly to the current insertion point in the send buffer. To support growing it, enlargeStringInfo would first subtract the offset to the start of the allocation, and then reallocate that.
I thought about it yesterday and one issue with this approach is how would you known the length of the packet to be sent. As we can’t returned back in PqSendBuffer. Also realloc is quite expensive operation.
Previously I suggested to include offset into stringinfo and if it is large enough we will have an opportunity to send it directly and it will not required a lot of changes.
On Fri, Mar 6, 2020 at 10:45 Andres Freund <andres@anarazel.de> wrote:
Hi,
On March 5, 2020 1:23:21 PM PST, Aleksei Ivanov <iv.alekseii@gmail.com> wrote:
>Thank you for your reply!
>
>Yes, you are right there will be a separate call to send the data, but
>is
>copying data each time more costly operation than just one syscall?
Yes, it's very likely to be more expensive to execute a syscall in a lot of cases. They've gotten a lot more expensive with all the security issues.
>Besides, if we already have a ready message packet to be sent why
>should we
>wait?
In a lot of cases we'll send a number of small messages after each other. We don't want to send those out separately, that'd just increase overhead.
But in some paths/workloads the copy is quite noticable. I've mused before whether we could extend StringInfo to handle cases like this. E.g. by having StringInfo have two lengths. One that is the offset to the start of the allocated memory (0 for plain StringInfos), and one for the length of the string being built.
Then we could get a StringInfo pointing directly to the current insertion point in the send buffer. To support growing it, enlargeStringInfo would first subtract the offset to the start of the allocation, and then reallocate that.
I can imagine that bring useful in a number of places. And because there only would be additional overhead when actually growing the StringInfo, I don't think the cost would be measurable.
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
Hi, On 2020-03-06 11:09:23 -0800, Aleksei Ivanov wrote: > *>Then we could get a StringInfo pointing directly to the current insertion > point in the send buffer. To support growing it, enlargeStringInfo would > first subtract the offset to the start of the allocation, and then > reallocate that*. > > I thought about it yesterday and one issue with this approach is how would > you known the length of the packet to be sent. As we can’t returned back in > PqSendBuffer. Also realloc is quite expensive operation. Could you expand a bit on what you see as the problem? Because I'm not following? What does any of this have to do with realloc performance? I mean, the buffer would just scale up once, so the cost of that would be very quickly amortized? What I'm thinking is that we'd have pg_beginmessage() (potentially a different named version) initialize the relevant StringInfo basically as (StringInfoData){ .data = PqSendPointer, .len = 0, .alloc_offset = PqSendBuffer - PqSendBuffer } and that pq_endmessage would then advance the equivalent (see below [1]) of what today is PqSendPointer to be PqSendPointer += StringInfo->len; That'd mean that we'd never need to copy data in/out the send buffer anymore, because we'd directly construct the message in the send buffer. Pretty much all important FE/BE communication goes through pq_beginmessage[_reuse()]. We'd have to add some defenses against building multiple messages at the same time. But neither do I think that is common, nor does it seem hard to defend againt: A simple counter should do the trick? Regards, Andres [1] Obviously the above sketch doesn't quite work that way. We can't just have stringinfo reallocate the buffer. Feels quite solvable though.
Andres Freund <andres@anarazel.de> writes: > What I'm thinking is that we'd have pg_beginmessage() (potentially a > different named version) initialize the relevant StringInfo basically as > (StringInfoData){ > .data = PqSendPointer, > .len = 0, > .alloc_offset = PqSendBuffer - PqSendBuffer > } This seems way overcomplicated compared to what I suggested (ie, just let internal_putbytes bypass the buffer in cases where the data would get flushed next time through its loop anyway). What you're suggesting would be a lot more invasive and restrictive --- for example, why is it a good idea to have a hard-wired assumption that we can't build more than one message at once? I'm also concerned that trying to do too much optimization here will break one of the goals of the existing code, which is to not get into a situation where an OOM failure causes a wire protocol violation because we've already sent part of a message but are no longer able to send the rest of it. To ensure that doesn't happen, we must construct the whole message before we start to send it, and we can't let buffering of the last message be too entwined with construction of the next one. Between that and the (desirable) arms-length separation between datatype I/O functions and the actual I/O, a certain amount of data copying seems unavoidable. regards, tom lane
Hi, On 2020-03-07 13:54:57 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > What I'm thinking is that we'd have pg_beginmessage() (potentially a > > different named version) initialize the relevant StringInfo basically as > > > (StringInfoData){ > > .data = PqSendPointer, > > .len = 0, > > .alloc_offset = PqSendBuffer - PqSendBuffer > > } > > This seems way overcomplicated compared to what I suggested (ie, > just let internal_putbytes bypass the buffer in cases where the > data would get flushed next time through its loop anyway). Well, we quite frequently send out multiple messages in a row, without a flush inbetween. It'd be nice if we could avoid both copying buffers for each message, as well as allocating a new stringinfo. We've reduced the number of wholesale stringinfo reallocations with pq_beginmessage_reuse(), which is e.g. significant when actually returning tuples, and that was a noticable performance improvement. I don't believe that the copy is a performance relevant factor solely for messages that are individually too large to fit in the send buffer. For one, there'll often be some pending send data from a previous "round", which'd mean we'd need to call send() more often, or use vectorized IO (i.e. switch to writev()). But also, > What you're suggesting would be a lot more invasive and restrictive > --- for example, why is it a good idea to have a hard-wired > assumption that we can't build more than one message at once? Well, we don't seem to have many (any?) places where that's not the case. And having to use only one layer of buffering for outgoing data does seem advantageous to me. It'd not be hard to fall back to a separate buffer just for the cases where there are multiple messages built concurrently, if we want to support that. > I'm also concerned that trying to do too much optimization here will > break one of the goals of the existing code, which is to not get into > a situation where an OOM failure causes a wire protocol violation > because we've already sent part of a message but are no longer able to > send the rest of it. To ensure that doesn't happen, we must construct > the whole message before we start to send it, and we can't let > buffering of the last message be too entwined with construction of the > next one. Between that and the (desirable) arms-length separation > between datatype I/O functions and the actual I/O, a certain amount of > data copying seems unavoidable. Sure. But I don't see why that requires two levels of buffering for messages? If we were to build the message in the output buffer, resizing as needed, we can send the data once the message is complete, or not at all. I don't think anything on the datatype I/O level would be affected? While I think it'd be quite desirable to avoid e.g. the separate stringinfo allocation for send functions, I think that's quite a separate project. One which I have no really good idea to tackle. Greetings, Andres Freund [1] Since I had looked it up: We do a separate message for each of: 1) result description 2) each result row 3) ReadyForQuery And we separately call through PQcommMethods for things like pq_putemptymessage() and uses of pq_putmessage() not going through pq_endmessage. The former is called a lot, especially when using the extended query protocol (which we want clients to use!). For a SELECT 1 in the simple protocol we end up calling putmessage via: 1) SendRowDescriptionMessage 2) printtup() 3) EndCommand() 4) ReadyForQuery() For extended: 1) exec_parse_message() 2) exec_bind_message() 3) exec_describe_portal_message() 4) printtup() 5) EndCommand() 6) ReadyForQuery()
Andres Freund <andres@anarazel.de> writes: > On 2020-03-07 13:54:57 -0500, Tom Lane wrote: >> This seems way overcomplicated compared to what I suggested (ie, >> just let internal_putbytes bypass the buffer in cases where the >> data would get flushed next time through its loop anyway). > Well, we quite frequently send out multiple messages in a row, without a > flush inbetween. It'd be nice if we could avoid both copying buffers for > each message, as well as allocating a new stringinfo. That gets you right into the situation where trouble adding more messages could corrupt/destroy messages that were supposedly already sent (but in reality aren't flushed to the client quite yet). I really think that there is not enough win available here to justify introducing that kind of fragility. To be blunt, no actual evidence has been offered in this thread that it's worth changing anything at all in this area. All of the bytes in question eventually have to be delivered to the client, which is going to involve two kernel-space/user-space copy steps along with who-knows-what network transmission overhead. The idea that an extra process-local memcpy or two is significant compared to that seems like mostly wishful thinking to me. regards, tom lane
faifaiOn Sat, 7 Mar 2020 at 02:45, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On March 5, 2020 1:23:21 PM PST, Aleksei Ivanov <iv.alekseii@gmail.com> wrote: > >Thank you for your reply! > > > >Yes, you are right there will be a separate call to send the data, but > >is > >copying data each time more costly operation than just one syscall? > > Yes, it's very likely to be more expensive to execute a syscall in a lot of cases. They've gotten a lot more expensivewith all the security issues. Good to know. > >Besides, if we already have a ready message packet to be sent why > >should we > >wait? > > In a lot of cases we'll send a number of small messages after each other. We don't want to send those out separately, that'djust increase overhead. Right. We presently set `TCP_NODELAY` to disable Nagle's algorithm, so we'd tend to generate these messages as separate packets as well as seperate syscalls, making it doubly undesirable. We can dynamically twiddle TCP_NODELAY like many other applications do to optimise the wire packets, but not the syscalls. It'd actually cost extra syscalls. > But in some paths/workloads the copy is quite noticable. I've mused before whether we could extend StringInfo to handlecases like this. E.g. by having StringInfo have two lengths. One that is the offset to the start of the allocated memory(0 for plain StringInfos), and one for the length of the string being built. > > Then we could get a StringInfo pointing directly to the current insertion point in the send buffer. To support growingit, enlargeStringInfo would first subtract the offset to the start of the allocation, and then reallocate that. > > I can imagine that bring useful in a number of places. And because there only would be additional overhead when actuallygrowing the StringInfo, I don't think the cost would be measurable. That sounds pretty sensible as it'd be minimally intrusive. I've wondered whether we can further optimise some uses by having libpq-be manage an iovec for us instead, much like we support iovec for shm_mq. Instead of a StringInfo we'd use an iovec wrapped by a libpq-managed struct. libpq would reserve the first few entries in the managed iovec for the message header. Variants of pq_sendint64(...) etc would add entries to the iovec and could be inlined since they'd just be convenience routines. The message would be flushed by having libpq call writev() on the iovec container. We'd want a wrapper struct for the iovec so we could have libpq keep a cursor for the next entry in the iovec. For libpq-fe it'd also contain a map of which iovec entries need to be free()d; for libpq-be we'd probably palloc(), maybe with a child memory context. To support a stack-allocated iovec for when we know all the message fields in advance we'd have an init function that takes the address of the preallocated iovec and its length limit. We could support We could also support a libpq-wrapped iovec where libpq can realloc it if it fills. stack-allocated iovec - so the caller would be responsible for managing the max size That way we can do zero-copy scatter-gather I/O for messages that don't require binary-to-text-format transformations etc. BTW, if we change StringInfo, I'd like to also official bless the usage pattern where we wrap a buffer in a StringInfo so we can use pq_getmsgint64 etc on it. Add a initConstStringInfo(StringInfo si, const char * buf, size_t buflen) or something that assigns the StringInfo values and sets maxlen = -1. The only in-core usage I see for this so far is in src/backend/replication/logical/worker.c but it's used extremely heavily in pglogical etc. It'd just be a convenience function that blesses and documents existing usage. But like Tom I really want to first come back to the evidence. Why should we bother? Are we solving an actual problem here? PostgreSQL is immensely memory-allocation-happy and copy-happy. Shouldn't we be more interested in things like reducing the cost of multiple copies and transform passes of Datum values? Especially since that's an actual operational pain point when you're working with multi-hundred-megabyte bytea or text fields. Can you come up with some profiling/performance numbers that track time spent on memory copying in the areas you propose to target, plus malloc overheads? With a tool like systemtap or perf it should not be overly difficult to do so by making targeted probes that filter based on callstack, or on file / line-range or function.