Hi,
On 2020-05-18 12:38:05 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> >> FWIW, I've also observed, in another thread (the node func generation
> >> thing [1]), that inlining enlargeStringInfo() helps a lot, especially
> >> when inlining some of its callers. Moving e.g. appendStringInfo() inline
> >> allows the compiler to sometimes optimize away the strlen. But if
> >> e.g. an inlined appendBinaryStringInfo() still calls enlargeStringInfo()
> >> unconditionally, successive appends cannot optimize away memory accesses
> >> for ->len/->data.
>
> > With a set of patches doing so, int4send itself is not a significant
> > factor for my test benchmark [1] anymore.
>
> This thread seems to have died out, possibly because the last set of
> patches that Andres posted was sufficiently complicated and invasive
> that nobody wanted to review it.
Well, I wasn't really planning to try to get that patchset into 13, and
it wasn't in the CF...
> I thought about this again after seeing that [1] is mostly about
> pq_begintypsend overhead
I'm not really convinced that that's the whole problem. Using the
benchmark from upthread, I get (median of three):
master: 1181.581
yours: 1171.445
mine: 598.031
That's a very significant difference, imo. It helps a bit with the
benchmark from your [1], but not that much.
> With 0001, pq_begintypsend drops from being the top single routine
> in a profile of a test case like [1] to being well down the list.
> The next biggest cost compared to text-format output is that
> printtup() itself is noticeably more expensive. A lot of the extra
> cost there seems to be from pq_sendint32(), which is getting inlined
> into printtup(), and there probably isn't much we can do to make that
> cheaper. But eliminating a common subexpression as in 0002 below does
> help noticeably, at least with the rather old gcc I'm using.
I think there's plenty more we can do:
First, it's unnecessary to re-initialize a FunctionCallInfo for every
send/recv/out/in call. Instead we can reuse the same over and over.
After that, the biggest remaining overhead for Jack's test is the palloc
for the stringinfo, as far as I can see. I've complained about that
before...
I've just hacked up a modification where, for send functions,
fcinfo->context contains a stringinfo set up by printtup/CopyTo. That,
combined with using a FunctionCallInfo set up beforehand, instead of
re-initializing it in every printtup cycle, results in a pretty good
saving.
Making the binary protocol 20% faster than text, in Jack's testcase. And
my lotsaints4 test, goes further down to ~410ms (this is 2.5x faster
than where started).
Now obviously, the hack with passing a StringInfo in ->context is just
that, a hack. A somewhat gross one even. But I think it pretty clearly
shows the problem and the way out.
I don't know what the best non-gross solution for the overhead of the
out/send functions is. There seems to be at least the following
major options (and a lots of variants thereof):
1) Just continue to incur significant overhead for every datum
2) Accept the uglyness of passing in a buffer via
FunctionCallInfo->context. Change nearly all in-core send functions
over to that.
3) Pass string buffer through an new INTERNAL argument to send/output
function, allow both old/new style send functions. Use a wrapper
function to adapt the "old style" to the "new style".
4) Like 3, but make the new argument optional, and use ad-hoc
stringbuffer if not provided. I don't like the unnecessary branches
this adds.
The biggest problem after that is that we waste a lot of time memcpying
stuff around repeatedly. There is:
1) send function: datum -> per datum stringinfo
2) printtup: per datum stringinfo -> per row stringinfo
3) socket_putmessage: per row stringinfo -> PqSendBuffer
4) send(): PqSendBuffer -> kernel buffer
It's obviously hard to avoid 1) and 4) in the common case, but the
number of other copies seem pretty clearly excessive.
If we change the signature of the out/send function to always target a
string buffer, we could pretty easily avoid 2), and for out functions
we'd not have to redundantly call strlen (as the argument to
pq_sendcountedtext) anymore, which seems substantial too.
As I argued before, I think it's unnecessary to have a separate buffer
between 3-4). We should construct the outgoing message inside the send
buffer. I still don't understand what "recursion" danger there is,
nothing below printtup should ever send protocol messages, no?
Sometimes there's also 0) in the above, when detoasting a datum...
Greetings,
Andres Freund