Re: Why is pq_begintypsend so slow? - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Why is pq_begintypsend so slow? |
Date | |
Msg-id | CA+TgmoYS+YHL85FK1iGnNCAq_i_uXt-oQHiFCJWaAVbmByVd_A@mail.gmail.com Whole thread Raw |
In response to | Re: Why is pq_begintypsend so slow? (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
On Tue, Jun 2, 2020 at 9:56 PM Andres Freund <andres@anarazel.de> wrote: > I don't know what the best non-gross solution for the overhead of the > out/send functions is. There seems to be at least the following > major options (and a lots of variants thereof): > > 1) Just continue to incur significant overhead for every datum > 2) Accept the uglyness of passing in a buffer via > FunctionCallInfo->context. Change nearly all in-core send functions > over to that. > 3) Pass string buffer through an new INTERNAL argument to send/output > function, allow both old/new style send functions. Use a wrapper > function to adapt the "old style" to the "new style". > 4) Like 3, but make the new argument optional, and use ad-hoc > stringbuffer if not provided. I don't like the unnecessary branches > this adds. I ran into this problem in another context today while poking at some pg_basebackup stuff. There's another way of solving this problem which I think we should consider: just get rid of the per-row stringinfo and push the bytes directly from wherever they are into PqSendBuffer. Once we start doing this, we can't error out, because internal_flush() might've been called, sending a partial message to the client. Trying to now switch to sending an ErrorResponse will break protocol sync. But it seems possible to avoid that. Just call all of the output functions first, and also do any required encoding conversions (pq_sendcountedtext -> pg_server_to_client). Then, do a bunch of pq_putbytes() calls to shove the message out -- there's the small matter of an assertion failure, but whatever. This effectively collapses two copies into one. Or alternatively, build up an array of iovecs and then have a variant of pq_putmessage(), like pq_putmessage_iovec(), that knows what to do with them. One advantage of this approach is that it approximately doubles the size of the DataRow message we can send. We're currently limited to <1GB because of palloc, but the wire protocol just needs it to be <2GB so that a signed integer does not overflow. It would be nice to buy more than a factor of two here, but that would require a wire protocol change, and 2x is not bad. Another advantage of this approach is that it doesn't require passing StringInfos all over the place. For the use case that I was looking at, that appears awkward. I'm not saying I couldn't make it work, but it wouldn't be my first choice. Right now, I've got data bubbling down a chain of handlers until it eventually gets sent off to the client; with your approach, I think I'd need to bubble buffers up and then bubble data down, which seems quite a bit more complex. A disadvantage of this approach is that we still end up doing three copies: one from the datum to the per-datum StringInfo, a second into PqSendBuffer, and a third from there to the kernel. However, we could probably improve on this. Whenever we internal_flush(), consider whether the chunk of data we're the process of copying (the current call to pq_putbytes(), or the current iovec) has enough bytes remaining to completely refill the buffer. If so, secure_write() a buffer's worth of bytes (or more) directly, bypassing PqSendBuffer. That way, we avoid individual system calls (or calls to OpenSSL or GSS) for small numbers of bytes, but we also avoid extra copying when transmitting larger amounts of data. Even with that optimization, this still seems like it could end up being less efficient than your proposal (surprise, surprise). If we've got a preallocated buffer which we won't be forced to resize during message construction -- and for DataRow messages we can get there just by keeping the buffer around, so that we only need to reallocate when we see a larger message than we've ever seen before -- and we write all the data directly into that buffer and then send it from there straight to the kernel, we only ever do 2 copies, whereas what I'm proposing sometimes does 3 copies and sometimes only 2. While I admit that's not great, it seems likely to still be a significant win over what we have now, and it's a *lot* less invasive than your proposal. Not only does your approach require changing all of the type-output and type-sending functions inside and outside core to use this new model, admittedly with the possibility of backward compatibility, but it also means that we could need similarly invasive changes in any other place that wants to use this new style of message construction. You can't write any data anywhere that you might want to later incorporate into a protocol message unless you write it into a StringInfo; and not only that, but you have to be able to get the right amount of data into the right place in the StringInfo right from the start. I think that in some cases that will require fairly complex orchestration. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: