Thread: Why is pq_begintypsend so slow?
I was using COPY recently and was wondering why BINARY format is not much (if any) faster than the default format. Once I switched from mostly exporting ints to mostly exporting double precisions (7e6 rows of 100 columns, randomly generated), it was faster, but not by as much as I intuitively thought it should be.
Running 'perf top' to profile a "COPY BINARY .. TO '/dev/null'" on a AWS m5.large machine running Ubuntu 18.04, with self compiled PostgreSQL:
PostgreSQL 13devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0, 64-bit
I saw that the hotspot was pq_begintypsend at 20%, which was twice the percentage as the next place winner (AllocSetAlloc). If I drill down into teh function, I see something like the below. I don't really speak assembly, but usually when I see an assembly instruction being especially hot and not being the inner most instruction in a loop, I blame it on CPU cache misses. But everything being touched here should already be well cached, since initStringInfo has just got done setting it up. And if not for that, then the by the 2nd invocation of appendStringInfoCharMacro it certainly should be in the cache, yet that one is even slower than the 1st appendStringInfoCharMacro.
Why is this such a bottleneck?
pq_begintypsend /usr/local/pgsql/bin/postgres
0.15 | push %rbx
0.09 | mov %rdi,%rbx
| initStringInfo(buf);
3.03 | callq initStringInfo
| /* Reserve four bytes for the bytea length word */
| appendStringInfoCharMacro(buf, '\0');
| movslq 0x8(%rbx),%rax
1.05 | lea 0x1(%rax),%edx
0.72 | cmp 0xc(%rbx),%edx
| jge b0
2.92 | mov (%rbx),%rdx
| movb $0x0,(%rdx,%rax,1)
13.76 | mov 0x8(%rbx),%eax
0.81 | mov (%rbx),%rdx
0.52 | add $0x1,%eax
0.12 | mov %eax,0x8(%rbx)
2.85 | cltq
0.01 | movb $0x0,(%rdx,%rax,1)
| appendStringInfoCharMacro(buf, '\0');
10.65 | movslq 0x8(%rbx),%rax
| lea 0x1(%rax),%edx
0.90 | cmp 0xc(%rbx),%edx
| jge ca
0.54 | 42: mov (%rbx),%rdx
1.84 | movb $0x0,(%rdx,%rax,1)
13.88 | mov 0x8(%rbx),%eax
0.03 | mov (%rbx),%rdx
| add $0x1,%eax
0.33 | mov %eax,0x8(%rbx)
2.60 | cltq
0.06 | movb $0x0,(%rdx,%rax,1)
| appendStringInfoCharMacro(buf, '\0');
3.21 | movslq 0x8(%rbx),%rax
0.23 | lea 0x1(%rax),%edx
1.74 | cmp 0xc(%rbx),%edx
| jge e0
0.21 | 67: mov (%rbx),%rdx
1.18 | movb $0x0,(%rdx,%rax,1)
9.29 | mov 0x8(%rbx),%eax
0.18 | mov (%rbx),%rdx
| add $0x1,%eax
0.19 | mov %eax,0x8(%rbx)
3.14 | cltq
0.12 | movb $0x0,(%rdx,%rax,1)
| appendStringInfoCharMacro(buf, '\0');
5.29 | movslq 0x8(%rbx),%rax
0.03 | lea 0x1(%rax),%edx
1.45 | cmp 0xc(%rbx),%edx
| jge f6
0.41 | 8c: mov (%rbx),%rdx
0.09 | mov %rdi,%rbx
| initStringInfo(buf);
3.03 | callq initStringInfo
| /* Reserve four bytes for the bytea length word */
| appendStringInfoCharMacro(buf, '\0');
| movslq 0x8(%rbx),%rax
1.05 | lea 0x1(%rax),%edx
0.72 | cmp 0xc(%rbx),%edx
| jge b0
2.92 | mov (%rbx),%rdx
| movb $0x0,(%rdx,%rax,1)
13.76 | mov 0x8(%rbx),%eax
0.81 | mov (%rbx),%rdx
0.52 | add $0x1,%eax
0.12 | mov %eax,0x8(%rbx)
2.85 | cltq
0.01 | movb $0x0,(%rdx,%rax,1)
| appendStringInfoCharMacro(buf, '\0');
10.65 | movslq 0x8(%rbx),%rax
| lea 0x1(%rax),%edx
0.90 | cmp 0xc(%rbx),%edx
| jge ca
0.54 | 42: mov (%rbx),%rdx
1.84 | movb $0x0,(%rdx,%rax,1)
13.88 | mov 0x8(%rbx),%eax
0.03 | mov (%rbx),%rdx
| add $0x1,%eax
0.33 | mov %eax,0x8(%rbx)
2.60 | cltq
0.06 | movb $0x0,(%rdx,%rax,1)
| appendStringInfoCharMacro(buf, '\0');
3.21 | movslq 0x8(%rbx),%rax
0.23 | lea 0x1(%rax),%edx
1.74 | cmp 0xc(%rbx),%edx
| jge e0
0.21 | 67: mov (%rbx),%rdx
1.18 | movb $0x0,(%rdx,%rax,1)
9.29 | mov 0x8(%rbx),%eax
0.18 | mov (%rbx),%rdx
| add $0x1,%eax
0.19 | mov %eax,0x8(%rbx)
3.14 | cltq
0.12 | movb $0x0,(%rdx,%rax,1)
| appendStringInfoCharMacro(buf, '\0');
5.29 | movslq 0x8(%rbx),%rax
0.03 | lea 0x1(%rax),%edx
1.45 | cmp 0xc(%rbx),%edx
| jge f6
0.41 | 8c: mov (%rbx),%rdx
Cheers,
Jeff
Jeff Janes <jeff.janes@gmail.com> writes: > I saw that the hotspot was pq_begintypsend at 20%, which was twice the > percentage as the next place winner (AllocSetAlloc). Weird. > Why is this such a bottleneck? Not sure, but it seems like a pretty dumb way to push the stringinfo's len forward. We're reading/updating the len word in each line, and if your perf measurements are to be believed, it's the re-fetches of the len values that are bottlenecked --- maybe your CPU isn't too bright about that? The bytes of the string value are getting written twice too, thanks to uselessly setting up a terminating nul each time. I'd be inclined to replace the appendStringInfoCharMacro calls with appendStringInfoSpaces(buf, 4) --- I don't think we care exactly what is inserted into those bytes at this point. And maybe appendStringInfoSpaces could stand some micro-optimization, too. Use a memset and a single len adjustment, perhaps? regards, tom lane
On Sat, Jan 11, 2020 at 03:19:37PM -0500, Tom Lane wrote: > Jeff Janes <jeff.janes@gmail.com> writes: > > I saw that the hotspot was pq_begintypsend at 20%, which was twice the > > percentage as the next place winner (AllocSetAlloc). > > Weird. > > > Why is this such a bottleneck? > > Not sure, but it seems like a pretty dumb way to push the stringinfo's > len forward. We're reading/updating the len word in each line, and > if your perf measurements are to be believed, it's the re-fetches of > the len values that are bottlenecked --- maybe your CPU isn't too > bright about that? The bytes of the string value are getting written > twice too, thanks to uselessly setting up a terminating nul each time. > > I'd be inclined to replace the appendStringInfoCharMacro calls with > appendStringInfoSpaces(buf, 4) --- I don't think we care exactly what > is inserted into those bytes at this point. And maybe > appendStringInfoSpaces could stand some micro-optimization, too. > Use a memset and a single len adjustment, perhaps? Please find attached a patch that does it both of the things you suggested. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Attachment
David Fetter <david@fetter.org> writes: > On Sat, Jan 11, 2020 at 03:19:37PM -0500, Tom Lane wrote: >> Jeff Janes <jeff.janes@gmail.com> writes: >>> I saw that the hotspot was pq_begintypsend at 20%, which was twice the >>> percentage as the next place winner (AllocSetAlloc). >> I'd be inclined to replace the appendStringInfoCharMacro calls with >> appendStringInfoSpaces(buf, 4) --- I don't think we care exactly what >> is inserted into those bytes at this point. And maybe >> appendStringInfoSpaces could stand some micro-optimization, too. >> Use a memset and a single len adjustment, perhaps? > Please find attached a patch that does it both of the things you > suggested. I've been fooling around with this here. On the test case Jeff describes --- COPY BINARY tab TO '/dev/null' where tab contains 100 float8 columns filled from random() --- I can reproduce his results. pq_begintypsend is the top hotspot and if perf's localization is accurate, it's the instructions that fetch str->len that hurt the most. Still not very clear why that is... Converting pq_begintypsend to use appendStringInfoSpaces helps a bit; it takes my test case from 91725 ms to 88847 ms, or about 3% savings. Noodling around with appendStringInfoSpaces doesn't help noticeably; I tried memset, as well as open-coding (cf patch below) but the results were all well within the noise threshold. I saw at this point that the remaining top spots were enlargeStringInfo and appendBinaryStringInfo, so I experimented with inlining them (again, see patch below). That *did* move the needle: down to 72691 ms, or 20% better than HEAD. Of course, that comes at a code-size cost, but it's only about 13kB growth: before: $ size src/backend/postgres text data bss dec hex filename 7485285 58088 203328 7746701 76348d src/backend/postgres after: $ size src/backend/postgres text data bss dec hex filename 7498652 58088 203328 7760068 7668c4 src/backend/postgres That's under two-tenths of a percent. (This'd affect frontend executables too, and I didn't check them.) Seems like this is worth pursuing, especially if it can be shown to improve any other cases noticeably. It might be worth inlining some of the other trivial stringinfo functions, though I'd tread carefully on that. regards, tom lane diff --git a/src/backend/libpq/pqformat.c b/src/backend/libpq/pqformat.c index a6f990c..0fc8c3f 100644 --- a/src/backend/libpq/pqformat.c +++ b/src/backend/libpq/pqformat.c @@ -328,11 +328,8 @@ void pq_begintypsend(StringInfo buf) { initStringInfo(buf); - /* Reserve four bytes for the bytea length word */ - appendStringInfoCharMacro(buf, '\0'); - appendStringInfoCharMacro(buf, '\0'); - appendStringInfoCharMacro(buf, '\0'); - appendStringInfoCharMacro(buf, '\0'); + /* Reserve four bytes for the bytea length word; contents not important */ + appendStringInfoSpaces(buf, 4); } /* -------------------------------- diff --git a/src/common/stringinfo.c b/src/common/stringinfo.c index 0badc46..8f8bb0d 100644 --- a/src/common/stringinfo.c +++ b/src/common/stringinfo.c @@ -207,43 +207,21 @@ appendStringInfoSpaces(StringInfo str, int count) { if (count > 0) { + char *ptr; + /* Make more room if needed */ enlargeStringInfo(str, count); /* OK, append the spaces */ + ptr = str->data + str->len; + str->len += count; while (--count >= 0) - str->data[str->len++] = ' '; - str->data[str->len] = '\0'; + *ptr++ = ' '; + *ptr = '\0'; } } /* - * appendBinaryStringInfo - * - * Append arbitrary binary data to a StringInfo, allocating more space - * if necessary. Ensures that a trailing null byte is present. - */ -void -appendBinaryStringInfo(StringInfo str, const char *data, int datalen) -{ - Assert(str != NULL); - - /* Make more room if needed */ - enlargeStringInfo(str, datalen); - - /* OK, append the data */ - memcpy(str->data + str->len, data, datalen); - str->len += datalen; - - /* - * Keep a trailing null in place, even though it's probably useless for - * binary data. (Some callers are dealing with text but call this because - * their input isn't null-terminated.) - */ - str->data[str->len] = '\0'; -} - -/* * appendBinaryStringInfoNT * * Append arbitrary binary data to a StringInfo, allocating more space @@ -263,7 +241,7 @@ appendBinaryStringInfoNT(StringInfo str, const char *data, int datalen) } /* - * enlargeStringInfo + * enlargeStringInfo_OOL * * Make sure there is enough space for 'needed' more bytes * ('needed' does not include the terminating null). @@ -274,13 +252,16 @@ appendBinaryStringInfoNT(StringInfo str, const char *data, int datalen) * can save some palloc overhead by enlarging the buffer before starting * to store data in it. * + * We normally reach here only if enlargement is needed, since callers + * go through the inline function which doesn't call this otherwise. + * * NB: In the backend, because we use repalloc() to enlarge the buffer, the * string buffer will remain allocated in the same memory context that was * current when initStringInfo was called, even if another context is now * current. This is the desired and indeed critical behavior! */ void -enlargeStringInfo(StringInfo str, int needed) +enlargeStringInfo_OOL(StringInfo str, int needed) { int newlen; diff --git a/src/include/lib/stringinfo.h b/src/include/lib/stringinfo.h index 5a2a3db..30ba826 100644 --- a/src/include/lib/stringinfo.h +++ b/src/include/lib/stringinfo.h @@ -87,6 +87,25 @@ extern void initStringInfo(StringInfo str); extern void resetStringInfo(StringInfo str); /*------------------------ + * enlargeStringInfo + * Make sure a StringInfo's buffer can hold at least 'needed' more bytes + * ('needed' does not include the terminating null). + * The inlined code eliminates the common case where no work is needed. + */ +extern void enlargeStringInfo_OOL(StringInfo str, int needed); + +static inline void +enlargeStringInfo(StringInfo str, int needed) +{ + /* + * Do the test in unsigned arithmetic so that if "needed" is negative, + * we'll go to the out-of-line code which will error out. + */ + if ((unsigned) needed >= (unsigned) (str->maxlen - str->len)) + enlargeStringInfo_OOL(str, needed); +} + +/*------------------------ * appendStringInfo * Format text data under the control of fmt (an sprintf-style format string) * and append it to whatever is already in str. More space is allocated @@ -139,10 +158,27 @@ extern void appendStringInfoSpaces(StringInfo str, int count); /*------------------------ * appendBinaryStringInfo * Append arbitrary binary data to a StringInfo, allocating more space - * if necessary. + * if necessary. Ensures that a trailing null byte is present. */ -extern void appendBinaryStringInfo(StringInfo str, - const char *data, int datalen); +static inline void +appendBinaryStringInfo(StringInfo str, const char *data, int datalen) +{ + Assert(str != NULL); + + /* Make more room if needed */ + enlargeStringInfo(str, datalen); + + /* OK, append the data */ + memcpy(str->data + str->len, data, datalen); + str->len += datalen; + + /* + * Keep a trailing null in place, even though it's probably useless for + * binary data. (Some callers are dealing with text but call this because + * their input isn't null-terminated.) + */ + str->data[str->len] = '\0'; +} /*------------------------ * appendBinaryStringInfoNT @@ -152,10 +188,4 @@ extern void appendBinaryStringInfo(StringInfo str, extern void appendBinaryStringInfoNT(StringInfo str, const char *data, int datalen); -/*------------------------ - * enlargeStringInfo - * Make sure a StringInfo's buffer can hold at least 'needed' more bytes. - */ -extern void enlargeStringInfo(StringInfo str, int needed); - #endif /* STRINGINFO_H */
I wrote: > I saw at this point that the remaining top spots were > enlargeStringInfo and appendBinaryStringInfo, so I experimented > with inlining them (again, see patch below). That *did* move > the needle: down to 72691 ms, or 20% better than HEAD. Oh ... marking the test in the inline part of enlargeStringInfo() as unlikely() helps quite a bit more: 66100 ms, a further 9% gain. Might be over-optimizing for this particular case, perhaps, but I think that's a reasonable marking given that we overallocate the stringinfo buffer for most uses. (But ... I'm not finding these numbers to be super reproducible across different ASLR layouts. So take it with a grain of salt.) regards, tom lane
Hi, On 2020-01-11 22:32:45 -0500, Tom Lane wrote: > I wrote: > > I saw at this point that the remaining top spots were > > enlargeStringInfo and appendBinaryStringInfo, so I experimented > > with inlining them (again, see patch below). That *did* move > > the needle: down to 72691 ms, or 20% better than HEAD. > > Oh ... marking the test in the inline part of enlargeStringInfo() > as unlikely() helps quite a bit more: 66100 ms, a further 9% gain. > Might be over-optimizing for this particular case, perhaps > > (But ... I'm not finding these numbers to be super reproducible > across different ASLR layouts. So take it with a grain of salt.) FWIW, I've also observed, in another thread (the node func generation thing [1]), that inlining enlargeStringInfo() helps a lot, especially when inlining some of its callers. Moving e.g. appendStringInfo() inline allows the compiler to sometimes optimize away the strlen. But if e.g. an inlined appendBinaryStringInfo() still calls enlargeStringInfo() unconditionally, successive appends cannot optimize away memory accesses for ->len/->data. For the case of send functions, we really ought to have at least pq_begintypsend(), enlargeStringInfo() and pq_endtypsend() inline. That way the compiler ought to be able to avoid repeatedly loading/storing ->len, after the initial initStringInfo() call. Might even make sense to also have initStringInfo() inline, because then the compiler would probably never actually materialize the StringInfoData (and would automatically have good aliasing information too). The commit referenced above is obviously quite WIP-ey, and contains things that should be split into separate commits. But I think it might be worth moving more functions into the header, like I've done in that commit. The commit also adds appendStringInfoU?Int(32,64) operations - I've unsuprisingly found these to be *considerably* faster than going through appendStringInfoString(). > but I think that's a reasonable marking given that we overallocate > the stringinfo buffer for most uses. Wonder if it's worth providing a function to initialize the stringinfo differently for the many cases where we have at least a very good idea of how long the string will be. It's sad to allocate 1kb just for e.g. int4send to send an integer plus length. Greetings, Andres Freund [1] https://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=commit;h=127e860cf65f50434e0bb97acbba4b0ea6f38cfd
Hi, On 2020-01-13 15:18:04 -0800, Andres Freund wrote: > On 2020-01-11 22:32:45 -0500, Tom Lane wrote: > > I wrote: > > > I saw at this point that the remaining top spots were > > > enlargeStringInfo and appendBinaryStringInfo, so I experimented > > > with inlining them (again, see patch below). That *did* move > > > the needle: down to 72691 ms, or 20% better than HEAD. > > > > Oh ... marking the test in the inline part of enlargeStringInfo() > > as unlikely() helps quite a bit more: 66100 ms, a further 9% gain. > > Might be over-optimizing for this particular case, perhaps > > > > (But ... I'm not finding these numbers to be super reproducible > > across different ASLR layouts. So take it with a grain of salt.) > > FWIW, I've also observed, in another thread (the node func generation > thing [1]), that inlining enlargeStringInfo() helps a lot, especially > when inlining some of its callers. Moving e.g. appendStringInfo() inline > allows the compiler to sometimes optimize away the strlen. But if > e.g. an inlined appendBinaryStringInfo() still calls enlargeStringInfo() > unconditionally, successive appends cannot optimize away memory accesses > for ->len/->data. With a set of patches doing so, int4send itself is not a significant factor for my test benchmark [1] anymore. The assembly looks about as good as one could hope, I think: # save rbx on the stack 0x00000000004b7f90 <+0>: push %rbx 0x00000000004b7f91 <+1>: sub $0x20,%rsp # store integer to be sent into rbx 0x00000000004b7f95 <+5>: mov 0x20(%rdi),%rbx # palloc length argument 0x00000000004b7f99 <+9>: mov $0x9,%edi 0x00000000004b7f9e <+14>: callq 0x5d9aa0 <palloc> # store integer in buffer (ebx is 4 byte portion of rbx) 0x00000000004b7fa3 <+19>: movbe %ebx,0x4(%rax) # store varlena header 0x00000000004b7fa8 <+24>: movl $0x20,(%rax) # restore stack and rbx registerz 0x00000000004b7fae <+30>: add $0x20,%rsp 0x00000000004b7fb2 <+34>: pop %rbx 0x00000000004b7fb3 <+35>: retq All the $0x20 constants are a bit confusing, but they just happen to be the same for int4send. It's the size of the stack frame, offset for FunctionCallInfoBaseData->args[0], the varlena header (and then the stack frame again) respectively. Note that I had to annotate palloc with __attribute__((malloc)) to make the compiler understand that palloc's returned value will not alias with anything problematic (e.g. the potential of aliasing with fcinfo prevents optimizing to the above without that annotation). I think such annotations would be a good idea anyway, precisely because they allow the compiler to optimize code significantly better. These together yields about a 1.8x speedup for me. The profile shows that the overhead now is overwhelmingly elsewhere: + 26.30% postgres postgres [.] CopyOneRowTo + 13.40% postgres postgres [.] tts_buffer_heap_getsomeattrs + 10.61% postgres postgres [.] AllocSetAlloc + 9.26% postgres libc-2.29.so [.] __memmove_avx_unaligned_erms + 7.32% postgres postgres [.] SendFunctionCall + 6.02% postgres postgres [.] palloc + 4.45% postgres postgres [.] int4send + 3.68% postgres libc-2.29.so [.] _IO_fwrite + 2.71% postgres postgres [.] heapgettup_pagemode + 1.96% postgres postgres [.] AllocSetReset + 1.83% postgres postgres [.] CopySendEndOfRow + 1.75% postgres libc-2.29.so [.] _IO_file_xsputn@@GLIBC_2.2.5 + 1.60% postgres postgres [.] ExecStoreBufferHeapTuple + 1.57% postgres postgres [.] DoCopyTo + 1.16% postgres postgres [.] memcpy@plt + 1.07% postgres postgres [.] heapgetpage Even without using the new pq_begintypesend_ex()/initStringInfoEx(), the generated code is still considerably better than before, yielding a 1.58x speedup. Tallocator overhead unsurprisingly is higher: + 24.93% postgres postgres [.] CopyOneRowTo + 17.10% postgres postgres [.] AllocSetAlloc + 10.09% postgres postgres [.] tts_buffer_heap_getsomeattrs + 6.50% postgres libc-2.29.so [.] __memmove_avx_unaligned_erms + 5.99% postgres postgres [.] SendFunctionCall + 5.11% postgres postgres [.] palloc + 3.95% postgres libc-2.29.so [.] _int_malloc + 3.38% postgres postgres [.] int4send + 2.54% postgres postgres [.] heapgettup_pagemode + 2.11% postgres libc-2.29.so [.] _int_free + 2.06% postgres postgres [.] MemoryContextReset + 2.02% postgres postgres [.] AllocSetReset + 1.97% postgres libc-2.29.so [.] _IO_fwrite + 1.47% postgres postgres [.] DoCopyTo + 1.14% postgres postgres [.] ExecStoreBufferHeapTuple + 1.06% postgres libc-2.29.so [.] _IO_file_xsputn@@GLIBC_2.2.5 + 1.04% postgres libc-2.29.so [.] malloc Adding a few pg_restrict*, and using appendBinaryStringInfoNT instead of appendBinaryStringInfo in CopySend* gains another 1.05x. This does result in some code growth, but given the size of the improvements, and that the improvements are significant even without code changes to callsites, that seems worth it. before: text data bss dec hex filename 8482739 172304 204240 8859283 872e93 src/backend/postgres after: text data bss dec hex filename 8604300 172304 204240 8980844 89096c src/backend/postgres Regards, Andres [1] CREATE TABLE lotsaints4(c01 int4 NOT NULL, c02 int4 NOT NULL, c03 int4 NOT NULL, c04 int4 NOT NULL, c05 int4 NOT NULL, c06int4 NOT NULL, c07 int4 NOT NULL, c08 int4 NOT NULL, c09 int4 NOT NULL, c10 int4 NOT NULL); INSERT INTO lotsaints4 SELECT ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random()* 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random()* 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int FROM generate_series(1, 2000000); VACUUM FREEZE lotsaints4; COPY lotsaints4 TO '/dev/null' WITH binary; CREATE TABLE lotsaints8(c01 int8 NOT NULL, c02 int8 NOT NULL, c03 int8 NOT NULL, c04 int8 NOT NULL, c05 int8 NOT NULL, c06int8 NOT NULL, c07 int8 NOT NULL, c08 int8 NOT NULL, c09 int8 NOT NULL, c10 int8 NOT NULL); INSERT INTO lotsaints8 SELECT ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random()* 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random()* 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int FROM generate_series(1, 2000000); VACUUM FREEZE lotsaints8; COPY lotsaints8 TO '/dev/null' WITH binary;
Attachment
- v1-0001-stringinfo-Move-more-functions-inline-provide-ini.patch
- v1-0002-stringinfo-Remove-in-core-use-of-appendStringInfo.patch
- v1-0003-pqformat-Move-functions-for-type-sending-inline-a.patch
- v1-0004-WIP-Annotate-palloc-with-malloc-and-other-compile.patch
- v1-0005-Move-int4-int8-send-to-use-pq_begintypsend_ex.patch
- v1-0006-copy-Use-appendBinaryStringInfoNT-for-sending-bin.patch
Andres Freund <andres@anarazel.de> writes: >> FWIW, I've also observed, in another thread (the node func generation >> thing [1]), that inlining enlargeStringInfo() helps a lot, especially >> when inlining some of its callers. Moving e.g. appendStringInfo() inline >> allows the compiler to sometimes optimize away the strlen. But if >> e.g. an inlined appendBinaryStringInfo() still calls enlargeStringInfo() >> unconditionally, successive appends cannot optimize away memory accesses >> for ->len/->data. > With a set of patches doing so, int4send itself is not a significant > factor for my test benchmark [1] anymore. This thread seems to have died out, possibly because the last set of patches that Andres posted was sufficiently complicated and invasive that nobody wanted to review it. I thought about this again after seeing that [1] is mostly about pq_begintypsend overhead, and had an epiphany: there isn't really a strong reason for pq_begintypsend to be inserting bits into the buffer at all. The bytes will be filled by pq_endtypsend, and nothing in between should be touching them. So I propose 0001 attached. It's poking into the stringinfo abstraction a bit more than I would want to do if there weren't a compelling performance reason to do so, but there evidently is. With 0001, pq_begintypsend drops from being the top single routine in a profile of a test case like [1] to being well down the list. The next biggest cost compared to text-format output is that printtup() itself is noticeably more expensive. A lot of the extra cost there seems to be from pq_sendint32(), which is getting inlined into printtup(), and there probably isn't much we can do to make that cheaper. But eliminating a common subexpression as in 0002 below does help noticeably, at least with the rather old gcc I'm using. For me, the combination of these two eliminates most but not quite all of the cost penalty of binary over text output as seen in [1]. regards, tom lane [1] https://www.postgresql.org/message-id/CAMovtNoHFod2jMAKQjjxv209PCTJx5Kc66anwWvX0mEiaXwgmA%40mail.gmail.com diff --git a/src/backend/libpq/pqformat.c b/src/backend/libpq/pqformat.c index a6f990c..03b7404 100644 --- a/src/backend/libpq/pqformat.c +++ b/src/backend/libpq/pqformat.c @@ -328,11 +328,16 @@ void pq_begintypsend(StringInfo buf) { initStringInfo(buf); - /* Reserve four bytes for the bytea length word */ - appendStringInfoCharMacro(buf, '\0'); - appendStringInfoCharMacro(buf, '\0'); - appendStringInfoCharMacro(buf, '\0'); - appendStringInfoCharMacro(buf, '\0'); + + /* + * Reserve four bytes for the bytea length word. We don't need to fill + * them with anything (pq_endtypsend will do that), and this function is + * enough of a hot spot that it's worth cheating to save some cycles. Note + * in particular that we don't bother to guarantee that the buffer is + * null-terminated. + */ + Assert(buf->maxlen > 4); + buf->len = 4; } /* -------------------------------- diff --git a/src/backend/access/common/printtup.c b/src/backend/access/common/printtup.c index dd1bac0..a9315c6 100644 --- a/src/backend/access/common/printtup.c +++ b/src/backend/access/common/printtup.c @@ -438,11 +438,12 @@ printtup(TupleTableSlot *slot, DestReceiver *self) { /* Binary output */ bytea *outputbytes; + int outputlen; outputbytes = SendFunctionCall(&thisState->finfo, attr); - pq_sendint32(buf, VARSIZE(outputbytes) - VARHDRSZ); - pq_sendbytes(buf, VARDATA(outputbytes), - VARSIZE(outputbytes) - VARHDRSZ); + outputlen = VARSIZE(outputbytes) - VARHDRSZ; + pq_sendint32(buf, outputlen); + pq_sendbytes(buf, VARDATA(outputbytes), outputlen); } }
Em seg., 18 de mai. de 2020 às 13:38, Tom Lane <tgl@sss.pgh.pa.us> escreveu:
Andres Freund <andres@anarazel.de> writes:
>> FWIW, I've also observed, in another thread (the node func generation
>> thing [1]), that inlining enlargeStringInfo() helps a lot, especially
>> when inlining some of its callers. Moving e.g. appendStringInfo() inline
>> allows the compiler to sometimes optimize away the strlen. But if
>> e.g. an inlined appendBinaryStringInfo() still calls enlargeStringInfo()
>> unconditionally, successive appends cannot optimize away memory accesses
>> for ->len/->data.
> With a set of patches doing so, int4send itself is not a significant
> factor for my test benchmark [1] anymore.
This thread seems to have died out, possibly because the last set of
patches that Andres posted was sufficiently complicated and invasive
that nobody wanted to review it. I thought about this again after
seeing that [1] is mostly about pq_begintypsend overhead, and had
an epiphany: there isn't really a strong reason for pq_begintypsend
to be inserting bits into the buffer at all. The bytes will be
filled by pq_endtypsend, and nothing in between should be touching
them. So I propose 0001 attached. It's poking into the stringinfo
abstraction a bit more than I would want to do if there weren't a
compelling performance reason to do so, but there evidently is.
With 0001, pq_begintypsend drops from being the top single routine
in a profile of a test case like [1] to being well down the list.
The next biggest cost compared to text-format output is that
printtup() itself is noticeably more expensive. A lot of the extra
cost there seems to be from pq_sendint32(), which is getting inlined
into printtup(), and there probably isn't much we can do to make that
cheaper. But eliminating a common subexpression as in 0002 below does
help noticeably, at least with the rather old gcc I'm using.
Again, I see problems with the types declared in Postgres.
1. pq_sendint32 (StringInfo buf, uint32 i)
2. extern void pq_sendbytes (StringInfo buf, const char * data, int datalen);
Wouldn't it be better to declare outputlen (0002) as uint32?
To avoid converting from (int) to (uint32), even if afterwards there is a conversion from (uint32) to (int)?
1. pq_sendint32 (StringInfo buf, uint32 i)
2. extern void pq_sendbytes (StringInfo buf, const char * data, int datalen);
Wouldn't it be better to declare outputlen (0002) as uint32?
To avoid converting from (int) to (uint32), even if afterwards there is a conversion from (uint32) to (int)?
regards,
Ranier Vilela
Ranier Vilela <ranier.vf@gmail.com> writes: > Again, I see problems with the types declared in Postgres. > 1. pq_sendint32 (StringInfo buf, uint32 i) > 2. extern void pq_sendbytes (StringInfo buf, const char * data, int > datalen); We could spend the next ten years cleaning up minor discrepancies like that, and have nothing much to show for the work. > To avoid converting from (int) to (uint32), even if afterwards there is a > conversion from (uint32) to (int)? You do realize that that conversion costs nothing? regards, tom lane
Hi, On 2020-05-18 12:38:05 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > >> FWIW, I've also observed, in another thread (the node func generation > >> thing [1]), that inlining enlargeStringInfo() helps a lot, especially > >> when inlining some of its callers. Moving e.g. appendStringInfo() inline > >> allows the compiler to sometimes optimize away the strlen. But if > >> e.g. an inlined appendBinaryStringInfo() still calls enlargeStringInfo() > >> unconditionally, successive appends cannot optimize away memory accesses > >> for ->len/->data. > > > With a set of patches doing so, int4send itself is not a significant > > factor for my test benchmark [1] anymore. > > This thread seems to have died out, possibly because the last set of > patches that Andres posted was sufficiently complicated and invasive > that nobody wanted to review it. Well, I wasn't really planning to try to get that patchset into 13, and it wasn't in the CF... > I thought about this again after seeing that [1] is mostly about > pq_begintypsend overhead I'm not really convinced that that's the whole problem. Using the benchmark from upthread, I get (median of three): master: 1181.581 yours: 1171.445 mine: 598.031 That's a very significant difference, imo. It helps a bit with the benchmark from your [1], but not that much. > With 0001, pq_begintypsend drops from being the top single routine > in a profile of a test case like [1] to being well down the list. > The next biggest cost compared to text-format output is that > printtup() itself is noticeably more expensive. A lot of the extra > cost there seems to be from pq_sendint32(), which is getting inlined > into printtup(), and there probably isn't much we can do to make that > cheaper. But eliminating a common subexpression as in 0002 below does > help noticeably, at least with the rather old gcc I'm using. I think there's plenty more we can do: First, it's unnecessary to re-initialize a FunctionCallInfo for every send/recv/out/in call. Instead we can reuse the same over and over. After that, the biggest remaining overhead for Jack's test is the palloc for the stringinfo, as far as I can see. I've complained about that before... I've just hacked up a modification where, for send functions, fcinfo->context contains a stringinfo set up by printtup/CopyTo. That, combined with using a FunctionCallInfo set up beforehand, instead of re-initializing it in every printtup cycle, results in a pretty good saving. Making the binary protocol 20% faster than text, in Jack's testcase. And my lotsaints4 test, goes further down to ~410ms (this is 2.5x faster than where started). Now obviously, the hack with passing a StringInfo in ->context is just that, a hack. A somewhat gross one even. But I think it pretty clearly shows the problem and the way out. I don't know what the best non-gross solution for the overhead of the out/send functions is. There seems to be at least the following major options (and a lots of variants thereof): 1) Just continue to incur significant overhead for every datum 2) Accept the uglyness of passing in a buffer via FunctionCallInfo->context. Change nearly all in-core send functions over to that. 3) Pass string buffer through an new INTERNAL argument to send/output function, allow both old/new style send functions. Use a wrapper function to adapt the "old style" to the "new style". 4) Like 3, but make the new argument optional, and use ad-hoc stringbuffer if not provided. I don't like the unnecessary branches this adds. The biggest problem after that is that we waste a lot of time memcpying stuff around repeatedly. There is: 1) send function: datum -> per datum stringinfo 2) printtup: per datum stringinfo -> per row stringinfo 3) socket_putmessage: per row stringinfo -> PqSendBuffer 4) send(): PqSendBuffer -> kernel buffer It's obviously hard to avoid 1) and 4) in the common case, but the number of other copies seem pretty clearly excessive. If we change the signature of the out/send function to always target a string buffer, we could pretty easily avoid 2), and for out functions we'd not have to redundantly call strlen (as the argument to pq_sendcountedtext) anymore, which seems substantial too. As I argued before, I think it's unnecessary to have a separate buffer between 3-4). We should construct the outgoing message inside the send buffer. I still don't understand what "recursion" danger there is, nothing below printtup should ever send protocol messages, no? Sometimes there's also 0) in the above, when detoasting a datum... Greetings, Andres Freund
Attachment
- v2-0001-stringinfo-Move-more-functions-inline-provide-ini.patch
- v2-0002-stringinfo-Remove-in-core-use-of-appendStringInfo.patch
- v2-0003-pqformat-Move-functions-for-type-sending-inline-a.patch
- v2-0004-WIP-Annotate-palloc-with-malloc-and-other-compile.patch
- v2-0005-Move-int4-int8-send-to-use-pq_begintypsend_ex.patch
- v2-0006-copy-Use-appendBinaryStringInfoNT-for-sending-bin.patch
- v2-0007-wip-make-send-recv-calls-in-printtup.c-copy.c-che.patch
- v2-0008-inline-pq_sendbytes-stop-maintaining-trailing-nul.patch
- v2-0009-heavy-wip-Allow-string-buffer-reuse-in-send-funct.patch
On Tue, Jun 2, 2020 at 9:56 PM Andres Freund <andres@anarazel.de> wrote: > The biggest problem after that is that we waste a lot of time memcpying > stuff around repeatedly. There is: > 1) send function: datum -> per datum stringinfo > 2) printtup: per datum stringinfo -> per row stringinfo > 3) socket_putmessage: per row stringinfo -> PqSendBuffer > 4) send(): PqSendBuffer -> kernel buffer > > It's obviously hard to avoid 1) and 4) in the common case, but the > number of other copies seem pretty clearly excessive. I too have seen recent benchmarking data where this was a big problem. Basically, you need a workload where the server doesn't have much or any actual query processing to do, but is just returning a lot of stuff to a really fast client - e.g. a locally connected client. That's not necessarily the most common case but, if you have it, all this extra copying is really pretty expensive. My first thought was to wonder about changing all of our send/output functions to write into a buffer passed as an argument rather than returning something which we then have to copy into a different buffer, but that would be a somewhat painful change, so it is probably better to first pursue the idea of getting rid of some of the other copies that happen in more centralized places (e.g. printtup). I wonder if we could replace the whole pq_beginmessage...()/pq_send....()/pq_endmessage...() system with something a bit better-designed. For instance, suppose we get rid of the idea that the caller supplies the buffer, and we move the responsibility for error recovery into the pqcomm layer. So you do something like: my_message = xyz_beginmessage('D'); xyz_sendint32(my_message, 42); xyz_endmessage(my_message); Maybe what happens here under the hood is we keep a pool of free message buffers sitting around, and you just grab one and put your data into it. When you end the message we add it to a list of used message buffers that are waiting to be sent, and once we send the data it goes back on the free list. If an error occurs after xyz_beginmessage() and before xyz_endmessage(), we put the buffer back on the free list. That would allow us to merge (2) and (3) into a single copy. To go further, we could allow send/output functions to opt in to receiving a message buffer rather than returning a value, and then we could get rid of (1) for types that participate. (4) seems unavoidable AFAIK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-06-03 11:30:42 -0400, Robert Haas wrote: > I too have seen recent benchmarking data where this was a big problem. > Basically, you need a workload where the server doesn't have much or > any actual query processing to do, but is just returning a lot of > stuff to a really fast client - e.g. a locally connected client. > That's not necessarily the most common case but, if you have it, all > this extra copying is really pretty expensive. Even when the query actually is doing something, it's still quite possible to get the memcpies to be be measurable (say > 10% of cycles). Obviously not in a huge aggregating query. Even in something like pgbench -M prepared -S, which is obviously spending most of its cycles elsewhere, the patches upthread improve throughput by ~1.5% (and that's not eliding several unnecessary copies). > My first thought was to wonder about changing all of our send/output > functions to write into a buffer passed as an argument rather than > returning something which we then have to copy into a different > buffer, but that would be a somewhat painful change, so it is probably > better to first pursue the idea of getting rid of some of the other > copies that happen in more centralized places (e.g. printtup). For those I think the allocator overhead is the bigger issue than the memcpy itself. I wonder how much we could transparently hide in pq_begintypsend()/pq_endtypsend(). > I > wonder if we could replace the whole > pq_beginmessage...()/pq_send....()/pq_endmessage...() system with > something a bit better-designed. For instance, suppose we get rid of > the idea that the caller supplies the buffer, and we move the > responsibility for error recovery into the pqcomm layer. So you do > something like: > > my_message = xyz_beginmessage('D'); > xyz_sendint32(my_message, 42); > xyz_endmessage(my_message); > > Maybe what happens here under the hood is we keep a pool of free > message buffers sitting around, and you just grab one and put your > data into it. Why do we need multiple buffers? ISTM we don't want to just send messages at endmsg() time, because that implies unnecessary syscall overhead. Nor do we want to imply the overhead of the copy from the message buffer to the network buffer. To me that seems to imply that the best approach would be to have PqSendBuffer be something stringbuffer like, and have pg_beginmessage() record the starting position of the current message somewhere (->cursor?). When an error is thrown, we reset the position to be where the in-progress message would have begun. I've previously outlined a slightly more complicated scheme, where we have "proxy" stringinfos that point into another stringinfo, instead of their own buffer. And know how to resize the "outer" buffer when needed. That'd have some advantages, but I'm not sure it's really needed. There's some disadvantages with what I describe above, in particular when dealing with send() sending only parts of our network buffer. We couldn't cheaply reuse the already sent memory in that case. I've before wondered / suggested that we should have StringInfos not insist on having one consecutive buffer (which obviously implies needing to copy contents when growing). Instead it should have a list of buffers containing chunks of the data, and never copy contents around while the string is being built. We'd only allocate a buffer big enough for all data when the caller actually wants to have all the resulting data in one string (rather than using an API that can iterate over chunks). For the network buffer case that'd allow us to reuse the earlier buffers even in the "partial send" case. And more generally it'd allow us to be less wasteful with buffer sizes, and perhaps even have a small "inline" buffer inside StringInfoData avoiding unnecessary memory allocations in a lot of cases where the common case is only a small amount of data being used. And I think the overhead while appending data to such a stringinfo should be neglegible, because it'd just require the exact same checks we already have to do for enlargeStringInfo(). > (4) seems unavoidable AFAIK. Not entirely. Linux can do zero-copy sends, but it does require somewhat complicated black magic rituals. Including more complex buffer management for the application, because the memory containing the to-be-sent data cannot be reused until the kernel notifies that it's done with the buffer. See https://www.kernel.org/doc/html/latest/networking/msg_zerocopy.html That might be something worth pursuing in the future (since it, I think, basically avoids spending any cpu cycles on copying data around in the happy path, relying on DMA instead), but I think for now there's much bigger fish to fry. I am hoping that somebody will write a nicer abstraction for zero-copy sends using io_uring. avoiding the need of a separate completion queue, by simply only signalling completion for the sendmsg operation once the buffer isn't needed anymore. There's no corresponding completion logic for normal sendmsg() calls, so it makes sense that something had to be invented before something like io_uring existed. Greetings, Andres Freund
On Wed, Jun 3, 2020 at 2:10 PM Andres Freund <andres@anarazel.de> wrote: > Why do we need multiple buffers? ISTM we don't want to just send > messages at endmsg() time, because that implies unnecessary syscall > overhead. Nor do we want to imply the overhead of the copy from the > message buffer to the network buffer. It would only matter if there are multiple messages being constructed at the same time, and that's probably not common, but maybe there's some way it can happen. It doesn't seem like it really costs anything to allow for it, and it might be useful sometime. For instance, consider your idea of using Linux black magic to do zero-copy sends. Now you either need multiple buffers, or you need one big buffer that you can recycle a bit at a time. > To me that seems to imply that the best approach would be to have > PqSendBuffer be something stringbuffer like, and have pg_beginmessage() > record the starting position of the current message somewhere > (->cursor?). When an error is thrown, we reset the position to be where > the in-progress message would have begun. Yeah, I thought about that, but then how you detect the case where two different people try to undertake message construction at the same time? Like, with the idea I was proposing, you could still decide to limit yourself to 1 buffer at the same time, and just elog() if someone tries to allocate a second buffer when you've already reached the maximum number of allocated buffers (i.e. one). But if you just have one buffer in a global variable and everybody writes into it, you might not notice if some unrelated code writes data into that buffer in the middle of someone else's message construction. Doing it the way I proposed, writing data requires passing a buffer pointer, so you can be sure that somebody had to get the buffer from somewhere... and any rules you want to enforce can be enforced at that point. > I've before wondered / suggested that we should have StringInfos not > insist on having one consecutive buffer (which obviously implies needing > to copy contents when growing). Instead it should have a list of buffers > containing chunks of the data, and never copy contents around while the > string is being built. We'd only allocate a buffer big enough for all > data when the caller actually wants to have all the resulting data in > one string (rather than using an API that can iterate over chunks). It's a thought. I doubt it's worth it for small amounts of data, but for large amounts it might be. On the other hand, a better idea still might be to size the buffer correctly from the start... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-06-09 11:46:09 -0400, Robert Haas wrote: > On Wed, Jun 3, 2020 at 2:10 PM Andres Freund <andres@anarazel.de> wrote: > > Why do we need multiple buffers? ISTM we don't want to just send > > messages at endmsg() time, because that implies unnecessary syscall > > overhead. Nor do we want to imply the overhead of the copy from the > > message buffer to the network buffer. > > It would only matter if there are multiple messages being constructed > at the same time, and that's probably not common, but maybe there's > some way it can happen. ISTM that it'd be pretty broken if it could happen. We cannot have two different parts of the system send messages to the client independently. The protocol is pretty stateful... > > To me that seems to imply that the best approach would be to have > > PqSendBuffer be something stringbuffer like, and have pg_beginmessage() > > record the starting position of the current message somewhere > > (->cursor?). When an error is thrown, we reset the position to be where > > the in-progress message would have begun. > > Yeah, I thought about that, but then how you detect the case where two > different people try to undertake message construction at the same > time? Set a boolean and assert out if one already is in progress? We'd need some state to know where to reset the position to on error anyway. > Like, with the idea I was proposing, you could still decide to limit > yourself to 1 buffer at the same time, and just elog() if someone > tries to allocate a second buffer when you've already reached the > maximum number of allocated buffers (i.e. one). But if you just have > one buffer in a global variable and everybody writes into it, you > might not notice if some unrelated code writes data into that buffer > in the middle of someone else's message construction. Doing it the way > I proposed, writing data requires passing a buffer pointer, so you can > be sure that somebody had to get the buffer from somewhere... and any > rules you want to enforce can be enforced at that point. I'd hope that we'd encapsulate the buffer management into file local variables in pqcomm.c or such, and that code outside of that cannot access the out buffer directly without using the appropriate helpers. > > I've before wondered / suggested that we should have StringInfos not > > insist on having one consecutive buffer (which obviously implies needing > > to copy contents when growing). Instead it should have a list of buffers > > containing chunks of the data, and never copy contents around while the > > string is being built. We'd only allocate a buffer big enough for all > > data when the caller actually wants to have all the resulting data in > > one string (rather than using an API that can iterate over chunks). > > It's a thought. I doubt it's worth it for small amounts of data, but > for large amounts it might be. On the other hand, a better idea still > might be to size the buffer correctly from the start... I think those are complimentary. I do agree that's it's useful to size stringinfos more appropriately immediately (there's an upthread patch adding a version of initStringInfo() that does so, quite useful for small stringinfos in particular). But there's enough cases where that's not really knowable ahead of time that I think it'd be quite useful to have support for the type of buffer I describe above. Greetings, Andres Freund
On Tue, Jun 9, 2020 at 3:23 PM Andres Freund <andres@anarazel.de> wrote: > ISTM that it'd be pretty broken if it could happen. We cannot have two > different parts of the system send messages to the client > independently. The protocol is pretty stateful... There's a difference between building messages concurrently and sending them concurrently. > Set a boolean and assert out if one already is in progress? We'd need > some state to know where to reset the position to on error anyway. Sure, that's basically just different notation for the same thing. I might prefer my notation over yours, but you might prefer the reverse. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jun 2, 2020 at 9:56 PM Andres Freund <andres@anarazel.de> wrote: > I don't know what the best non-gross solution for the overhead of the > out/send functions is. There seems to be at least the following > major options (and a lots of variants thereof): > > 1) Just continue to incur significant overhead for every datum > 2) Accept the uglyness of passing in a buffer via > FunctionCallInfo->context. Change nearly all in-core send functions > over to that. > 3) Pass string buffer through an new INTERNAL argument to send/output > function, allow both old/new style send functions. Use a wrapper > function to adapt the "old style" to the "new style". > 4) Like 3, but make the new argument optional, and use ad-hoc > stringbuffer if not provided. I don't like the unnecessary branches > this adds. I ran into this problem in another context today while poking at some pg_basebackup stuff. There's another way of solving this problem which I think we should consider: just get rid of the per-row stringinfo and push the bytes directly from wherever they are into PqSendBuffer. Once we start doing this, we can't error out, because internal_flush() might've been called, sending a partial message to the client. Trying to now switch to sending an ErrorResponse will break protocol sync. But it seems possible to avoid that. Just call all of the output functions first, and also do any required encoding conversions (pq_sendcountedtext -> pg_server_to_client). Then, do a bunch of pq_putbytes() calls to shove the message out -- there's the small matter of an assertion failure, but whatever. This effectively collapses two copies into one. Or alternatively, build up an array of iovecs and then have a variant of pq_putmessage(), like pq_putmessage_iovec(), that knows what to do with them. One advantage of this approach is that it approximately doubles the size of the DataRow message we can send. We're currently limited to <1GB because of palloc, but the wire protocol just needs it to be <2GB so that a signed integer does not overflow. It would be nice to buy more than a factor of two here, but that would require a wire protocol change, and 2x is not bad. Another advantage of this approach is that it doesn't require passing StringInfos all over the place. For the use case that I was looking at, that appears awkward. I'm not saying I couldn't make it work, but it wouldn't be my first choice. Right now, I've got data bubbling down a chain of handlers until it eventually gets sent off to the client; with your approach, I think I'd need to bubble buffers up and then bubble data down, which seems quite a bit more complex. A disadvantage of this approach is that we still end up doing three copies: one from the datum to the per-datum StringInfo, a second into PqSendBuffer, and a third from there to the kernel. However, we could probably improve on this. Whenever we internal_flush(), consider whether the chunk of data we're the process of copying (the current call to pq_putbytes(), or the current iovec) has enough bytes remaining to completely refill the buffer. If so, secure_write() a buffer's worth of bytes (or more) directly, bypassing PqSendBuffer. That way, we avoid individual system calls (or calls to OpenSSL or GSS) for small numbers of bytes, but we also avoid extra copying when transmitting larger amounts of data. Even with that optimization, this still seems like it could end up being less efficient than your proposal (surprise, surprise). If we've got a preallocated buffer which we won't be forced to resize during message construction -- and for DataRow messages we can get there just by keeping the buffer around, so that we only need to reallocate when we see a larger message than we've ever seen before -- and we write all the data directly into that buffer and then send it from there straight to the kernel, we only ever do 2 copies, whereas what I'm proposing sometimes does 3 copies and sometimes only 2. While I admit that's not great, it seems likely to still be a significant win over what we have now, and it's a *lot* less invasive than your proposal. Not only does your approach require changing all of the type-output and type-sending functions inside and outside core to use this new model, admittedly with the possibility of backward compatibility, but it also means that we could need similarly invasive changes in any other place that wants to use this new style of message construction. You can't write any data anywhere that you might want to later incorporate into a protocol message unless you write it into a StringInfo; and not only that, but you have to be able to get the right amount of data into the right place in the StringInfo right from the start. I think that in some cases that will require fairly complex orchestration. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Here is some review for the first few patches in this series. I am generally in favor of applying 0001-0003 no matter what else we decide to do here. However, as might be expected, I have a few questions and comments. Regarding 0001: I dislike the name initStringInfoEx() because we do not use the "Ex" convention for function names anywhere in the code base. We do sometimes use "extended", so this could be initStringInfoExtended(), perhaps, or something more specific, like initStringInfoWithLength(). Regarding the FIXME in that function, I suggest that this should be the caller's job. Otherwise, there will probably be some caller which doesn't want the add-one behavior, and then said caller will subtract one to compensate, and that will be silly. I am not familiar with pg_restrict and don't entirely understand the motivation for it here. I suspect you have done some experiments and figured out that it produces better code, but it would be interesting to hear more about how you got there. Perhaps there could even be some brief comments about it. Locutions like this are particularly confusing to me: +static inline void +resetStringInfo(StringInfoData *pg_restrict str) +{ + *(char *pg_restrict) (str->data) = '\0'; + str->len = 0; + str->cursor = 0; +} I don't understand how this can be saving anything. I think the restrict definitions here mean that str->data does not overlap with str itself, but considering that these are unconditional stores, so what? If the code were doing something like memset(str->data, 0, str->len) then I'd get it: it might be useful to know for optimization purposes that the memset isn't overwriting str->len. But what code can we produce for this that wouldn't be valid if str->data = &str? I assume this is my lack of understanding rather than an actual problem with the patch, but I would be grateful if you could explain. It is easier to see why the pg_restrict stuff you've introduced into appendBinaryStringInfoNT is potentially helpful: e.g. in appendBinaryStringInfoNT, it promises that memcpy can't clobber str->len, so the compiler is free to reorder without changing the results. Or so I imagine. But then the one in appendBinaryStringInfo() confuses me again: if str->data is already declared as a restricted pointer, then why do we need to cast str->data + str->len to be restricted also? In appendStringInfoChar, why do we need to cast to restrict twice? Can we not just do something like this: char *pg_restrict ep = str->data+str->len; ep[0] = ch; ep[1] = '\0'; Regarding 0002: Totally mechanical, seems fine. Regarding 0003: For the same reasons as above, I suggest renaming pq_begintypsend_ex() to pq_begintypsend_extended() or pq_begintypsend_with_length() or something of that sort, rather than using _ex. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-07-31 11:14:46 -0400, Robert Haas wrote: > Here is some review for the first few patches in this series. Thanks! > I am generally in favor of applying 0001-0003 no matter what else we > decide to do here. However, as might be expected, I have a few > questions and comments. > > Regarding 0001: > > I dislike the name initStringInfoEx() because we do not use the "Ex" > convention for function names anywhere in the code base. We do > sometimes use "extended", so this could be initStringInfoExtended(), > perhaps, or something more specific, like initStringInfoWithLength(). I dislike the length of the function name, but ... > Regarding the FIXME in that function, I suggest that this should be > the caller's job. Otherwise, there will probably be some caller which > doesn't want the add-one behavior, and then said caller will subtract > one to compensate, and that will be silly. Fair point. > I am not familiar with pg_restrict and don't entirely understand the > motivation for it here. I suspect you have done some experiments and > figured out that it produces better code, but it would be interesting > to hear more about how you got there. Perhaps there could even be some > brief comments about it. Locutions like this are particularly > confusing to me: > > +static inline void > +resetStringInfo(StringInfoData *pg_restrict str) > +{ > + *(char *pg_restrict) (str->data) = '\0'; > + str->len = 0; > + str->cursor = 0; > +} The restrict tells the compiler that 'str' and 'str->data' won't be pointing to the same memory. Which can simpilify the code its generating. E.g. it'll allow the compiler to keep str->data in a register, instead of reloading it for the next appendStringInfo*. Without the restrict it can't, because str->data = 0 could otherwise overwrite parts of the value of ->data itself, if ->data pointed into the StringInfo. Similarly, str->data could be overwritten by str->len in some other cases. Partially the reason we need to add the markers is that we compile with -fno-strict-aliasing. But even if we weren't, this case wouldn't be solved without an explicit marker even then, because char * is allowed to alias... Besides keeping ->data in a register, the restrict can also just entirely elide the null byte write in some cases, e.g. because the resetStringInfo() is followed by a appendStringInfoString(, "constant"). > I don't understand how this can be saving anything. I think the > restrict definitions here mean that str->data does not overlap with > str itself, but considering that these are unconditional stores, so > what? If the code were doing something like memset(str->data, 0, > str->len) then I'd get it: it might be useful to know for optimization > purposes that the memset isn't overwriting str->len. But what code can > we produce for this that wouldn't be valid if str->data = &str? I > assume this is my lack of understanding rather than an actual problem > with the patch, but I would be grateful if you could explain. I hope the above makes this make sene now? It's about subsequent uses of the StringInfo, rather than the body of resetStringInfo itself. > It is easier to see why the pg_restrict stuff you've introduced into > appendBinaryStringInfoNT is potentially helpful: e.g. in > appendBinaryStringInfoNT, it promises that memcpy can't clobber > str->len, so the compiler is free to reorder without changing the > results. Or so I imagine. But then the one in appendBinaryStringInfo() > confuses me again: if str->data is already declared as a restricted > pointer, then why do we need to cast str->data + str->len to be > restricted also? But str->data isn't declared restricted without the explicit use of restrict? str is restrict'ed, but it doesn't apply "recursively" to all pointers contained therein. > In appendStringInfoChar, why do we need to cast to restrict twice? Can > we not just do something like this: > > char *pg_restrict ep = str->data+str->len; > ep[0] = ch; > ep[1] = '\0'; I don't think that'd tell the compiler that this couldn't overlap with str itself? A single 'restrict' can never (?) help, you need *two* things that are marked as not overlapping in any way. Greetings, Andres Freund
On Fri, Jul 31, 2020 at 11:50 AM Andres Freund <andres@anarazel.de> wrote: > I hope the above makes this make sene now? It's about subsequent uses of > the StringInfo, rather than the body of resetStringInfo itself. That does make sense, except that https://en.cppreference.com/w/c/language/restrict says "During each execution of a block in which a restricted pointer P is declared (typically each execution of a function body in which P is a function parameter), if some object that is accessible through P (directly or indirectly) is modified, by any means, then all accesses to that object (both reads and writes) in that block must occur through P (directly or indirectly), otherwise the behavior is undefined." So my interpretation of this was that it couldn't really affect what happened outside of the function itself, even if the compiler chose to perform inlining. But I think see what you're saying: the *inference* is only valid with respect to restrict pointers in a particular function, but what can be optimized as a result of that inference may be something further afield, if inlining is performed. Perhaps we could add a comment about this, e.g. Marking these pointers with pg_restrict tells the compiler that str and str->data can't overlap, which may allow the compiler to optimize better when this code is inlined. For example, it may be possible to keep str->data in a register across consecutive appendStringInfoString operations. Since pg_restrict is not widely used, I think it's worth adding this kind of annotation, lest other hackers get confused. I'm probably not the only one who isn't on top of this. > > In appendStringInfoChar, why do we need to cast to restrict twice? Can > > we not just do something like this: > > > > char *pg_restrict ep = str->data+str->len; > > ep[0] = ch; > > ep[1] = '\0'; > > I don't think that'd tell the compiler that this couldn't overlap with > str itself? A single 'restrict' can never (?) help, you need *two* > things that are marked as not overlapping in any way. But what's the difference between: + *(char *pg_restrict) (str->data + str->len) = ch; + str->len++; + *(char *pg_restrict) (str->data + str->len) = '\0'; And: char *pg_restrict ep = str->data+str->len; ep[0] = ch; ep[1] = '\0'; ++str->len; Whether or not str itself is marked restricted is another question; what I'm talking about is why we need to repeat (char *pg_restrict) (str->data + str->len). I don't have any further comment on the remainder of your reply. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-07-31 12:28:04 -0400, Robert Haas wrote: > On Fri, Jul 31, 2020 at 11:50 AM Andres Freund <andres@anarazel.de> wrote: > > I hope the above makes this make sene now? It's about subsequent uses of > > the StringInfo, rather than the body of resetStringInfo itself. > > That does make sense, except that > https://en.cppreference.com/w/c/language/restrict says "During each > execution of a block in which a restricted pointer P is declared > (typically each execution of a function body in which P is a function > parameter), if some object that is accessible through P (directly or > indirectly) is modified, by any means, then all accesses to that > object (both reads and writes) in that block must occur through P > (directly or indirectly), otherwise the behavior is undefined." So my > interpretation of this was that it couldn't really affect what > happened outside of the function itself, even if the compiler chose to > perform inlining. But I think see what you're saying: the *inference* > is only valid with respect to restrict pointers in a particular > function, but what can be optimized as a result of that inference may > be something further afield, if inlining is performed. Right. There's two aspects: 1) By looking at the function, with the restrict, the compiler can infer more about the behaviour of the function. E.g. knowing that -> len has a specific value, or that ->data[n] does. That information then can be used together with subsequent operations, e.g. avoiding a re-read of ->len. That could work in some cases even if subsequent operations were *not* marked up with restrict. 2) The restrict signals to the compiler that we guarantee (i.e. it would be undefined behaviour if not) that the pointers do not overlap. Which means it can assume that in some of the calling code as well, if it can analyze that ->data isn't changed, for example. > Perhaps we could add a comment about this, e.g. > Marking these pointers with pg_restrict tells the compiler that str > and str->data can't overlap, which may allow the compiler to optimize > better when this code is inlined. For example, it may be possible to > keep str->data in a register across consecutive appendStringInfoString > operations. > > Since pg_restrict is not widely used, I think it's worth adding this > kind of annotation, lest other hackers get confused. I'm probably not > the only one who isn't on top of this. Would it make more sense to have a bit of an explanation at pg_restrict's definition, instead of having it at (eventually) multiple places? > > > In appendStringInfoChar, why do we need to cast to restrict twice? Can > > > we not just do something like this: > > > > > > char *pg_restrict ep = str->data+str->len; > > > ep[0] = ch; > > > ep[1] = '\0'; > > > > I don't think that'd tell the compiler that this couldn't overlap with > > str itself? A single 'restrict' can never (?) help, you need *two* > > things that are marked as not overlapping in any way. > > But what's the difference between: > > + *(char *pg_restrict) (str->data + str->len) = ch; > + str->len++; > + *(char *pg_restrict) (str->data + str->len) = '\0'; > > And: > > char *pg_restrict ep = str->data+str->len; > ep[0] = ch; > ep[1] = '\0'; > ++str->len; > > Whether or not str itself is marked restricted is another question; > what I'm talking about is why we need to repeat (char *pg_restrict) > (str->data + str->len). Ah, I misunderstood. Yea, there's no reason not to do that. Greetings, Andres Freund
On Fri, Jul 31, 2020 at 1:00 PM Andres Freund <andres@anarazel.de> wrote: > > Perhaps we could add a comment about this, e.g. > > Marking these pointers with pg_restrict tells the compiler that str > > and str->data can't overlap, which may allow the compiler to optimize > > better when this code is inlined. For example, it may be possible to > > keep str->data in a register across consecutive appendStringInfoString > > operations. > > > > Since pg_restrict is not widely used, I think it's worth adding this > > kind of annotation, lest other hackers get confused. I'm probably not > > the only one who isn't on top of this. > > Would it make more sense to have a bit of an explanation at > pg_restrict's definition, instead of having it at (eventually) multiple > places? I think, at least for the first few, it might be better to have a more specific explanation at the point of use, as it may be easier to understand in specific cases than in general. I imagine this only really makes sense for places that are pretty hot. > Ah, I misunderstood. Yea, there's no reason not to do that. OK, then I vote for that version, as I think it looks nicer. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, I was reminded of the patchset I had posted in this thread by https://postgr.es/m/679d5455cbbb0af667ccb753da51a475bae1eaed.camel%40cybertec.at On 2020-07-31 13:35:43 -0400, Robert Haas wrote: > On Fri, Jul 31, 2020 at 1:00 PM Andres Freund <andres@anarazel.de> wrote: > > > Perhaps we could add a comment about this, e.g. > > > Marking these pointers with pg_restrict tells the compiler that str > > > and str->data can't overlap, which may allow the compiler to optimize > > > better when this code is inlined. For example, it may be possible to > > > keep str->data in a register across consecutive appendStringInfoString > > > operations. > > > > > > Since pg_restrict is not widely used, I think it's worth adding this > > > kind of annotation, lest other hackers get confused. I'm probably not > > > the only one who isn't on top of this. > > > > Would it make more sense to have a bit of an explanation at > > pg_restrict's definition, instead of having it at (eventually) multiple > > places? > > I think, at least for the first few, it might be better to have a more > specific explanation at the point of use, as it may be easier to > understand in specific cases than in general. I imagine this only > really makes sense for places that are pretty hot. Whenever I looked at adding these comments, it felt wrong. We end up with repetitive boilerplate stuff as quite a few functions use it. I've thus not addressed this aspect in the attached rebased version. Perhaps a compromise would be to add such a comment to the top of stringinfo.h? > > Ah, I misunderstood. Yea, there's no reason not to do that. > > OK, then I vote for that version, as I think it looks nicer. Done. Greetings, Andres Freund
Attachment
- v3-0001-stringinfo-Move-more-functions-inline-provide-ini.patch
- v3-0002-stringinfo-Remove-in-core-use-of-appendStringInfo.patch
- v3-0003-WIP-Annotate-palloc-with-malloc-and-other-compile.patch
- v3-0004-pqformat-Move-type-sending-functions-inline-add-p.patch
- v3-0005-copy-Use-appendBinaryStringInfoNT-for-sending-bin.patch
- v3-0006-Use-pq_begintypsend_with_size-in-a-number-of-send.patch
- v3-0007-wip-make-send-calls-in-printtup.c-cheaper.patch
- v3-0008-wip-make-in-out-send-recv-calls-in-copy.c-cheaper.patch
- v3-0009-inline-pq_sendbytes-stop-maintaining-trailing-nul.patch
- v3-0010-heavy-wip-Allow-string-buffer-reuse-in-send-funct.patch
Hi, In <20240218015955.rmw5mcmobt5hbene@awork3.anarazel.de> "Re: Why is pq_begintypsend so slow?" on Sat, 17 Feb 2024 17:59:55 -0800, Andres Freund <andres@anarazel.de> wrote: > v3-0008-wip-make-in-out-send-recv-calls-in-copy.c-cheaper.patch It seems that this is an alternative approach of [1]. [1] https://www.postgresql.org/message-id/flat/20240215.153421.96888103784986803.kou%40clear-code.com#34df359e6d255795d16814ce138cc995 +typedef struct CopyInAttributeInfo +{ + AttrNumber num; + const char *name; + + Oid typioparam; + int32 typmod; + + FmgrInfo in_finfo; + union + { + FunctionCallInfoBaseData fcinfo; + char fcinfo_data[SizeForFunctionCallInfo(3)]; + } in_fcinfo; Do we need one FunctionCallInfoBaseData for each attribute? How about sharing one FunctionCallInfoBaseData by all attributes like [1]? @@ -956,20 +956,47 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext, values[m] = ExecEvalExpr(defexprs[m], econtext, &nulls[m]); } - - /* - * If ON_ERROR is specified with IGNORE, skip rows with soft - * errors - */ - else if (!InputFunctionCallSafe(&in_functions[m], - string, - typioparams[m], - att->atttypmod, - (Node *) cstate->escontext, - &values[m])) Inlining InputFuncallCallSafe() here to use pre-initialized fcinfo will decrease maintainability. Because we abstract InputFunctionCall family in fmgr.c. If we define a InputFunctionCall variant here, we need to change both of fmgr.c and here when InputFunctionCall family is changed. How about defining details in fmgr.c and call it here instead like [1]? + fcinfo->args[0].value = CStringGetDatum(string); + fcinfo->args[0].isnull = false; + fcinfo->args[1].value = ObjectIdGetDatum(attr->typioparam); + fcinfo->args[1].isnull = false; + fcinfo->args[2].value = Int32GetDatum(attr->typmod); + fcinfo->args[2].isnull = false; I think that "fcinfo->isnull = false;" is also needed like [1]. @@ -1966,7 +1992,7 @@ CopyReadBinaryAttribute(CopyFromState cstate, FmgrInfo *flinfo, if (fld_size == -1) { *isnull = true; - return ReceiveFunctionCall(flinfo, NULL, typioparam, typmod); + return ReceiveFunctionCall(fcinfo->flinfo, NULL, attr->typioparam, attr->typmod); Why pre-initialized fcinfo isn't used here? Thanks, -- kou
Hi, On 2024-02-18 17:38:09 +0900, Sutou Kouhei wrote: > In <20240218015955.rmw5mcmobt5hbene@awork3.anarazel.de> > "Re: Why is pq_begintypsend so slow?" on Sat, 17 Feb 2024 17:59:55 -0800, > Andres Freund <andres@anarazel.de> wrote: > > > v3-0008-wip-make-in-out-send-recv-calls-in-copy.c-cheaper.patch > > It seems that this is an alternative approach of [1]. Note that what I posted was a very lightly polished rebase of a ~4 year old patchset. > [1] https://www.postgresql.org/message-id/flat/20240215.153421.96888103784986803.kou%40clear-code.com#34df359e6d255795d16814ce138cc995 > > > +typedef struct CopyInAttributeInfo > +{ > + AttrNumber num; > + const char *name; > + > + Oid typioparam; > + int32 typmod; > + > + FmgrInfo in_finfo; > + union > + { > + FunctionCallInfoBaseData fcinfo; > + char fcinfo_data[SizeForFunctionCallInfo(3)]; > + } in_fcinfo; > > Do we need one FunctionCallInfoBaseData for each attribute? > How about sharing one FunctionCallInfoBaseData by all > attributes like [1]? That makes no sense to me. You're throwing away most of the possible gains by having to update the FunctionCallInfo fields on every call. You're saving neglegible amounts of memory at a substantial runtime cost. > @@ -956,20 +956,47 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext, > > values[m] = ExecEvalExpr(defexprs[m], econtext, &nulls[m]); > } > - > - /* > - * If ON_ERROR is specified with IGNORE, skip rows with soft > - * errors > - */ > - else if (!InputFunctionCallSafe(&in_functions[m], > - string, > - typioparams[m], > - att->atttypmod, > - (Node *) cstate->escontext, > - &values[m])) > > Inlining InputFuncallCallSafe() here to use pre-initialized > fcinfo will decrease maintainability. Because we abstract > InputFunctionCall family in fmgr.c. If we define a > InputFunctionCall variant here, we need to change both of > fmgr.c and here when InputFunctionCall family is changed. > How about defining details in fmgr.c and call it here > instead like [1]? I'm not sure I buy that that is a problem. It's not like my approach was actually bypassing fmgr abstractions alltogether - instead it just used the lower level APIs, because it's a performance sensitive area. > @@ -1966,7 +1992,7 @@ CopyReadBinaryAttribute(CopyFromState cstate, FmgrInfo *flinfo, > if (fld_size == -1) > { > *isnull = true; > - return ReceiveFunctionCall(flinfo, NULL, typioparam, typmod); > + return ReceiveFunctionCall(fcinfo->flinfo, NULL, attr->typioparam, attr->typmod); > > Why pre-initialized fcinfo isn't used here? Because it's a prototype and because I don't think it's a common path. Greetings, Andres Freund
On Sun, Feb 18, 2024 at 12:09:06PM -0800, Andres Freund wrote: > On 2024-02-18 17:38:09 +0900, Sutou Kouhei wrote: >> @@ -1966,7 +1992,7 @@ CopyReadBinaryAttribute(CopyFromState cstate, FmgrInfo *flinfo, >> if (fld_size == -1) >> { >> *isnull = true; >> - return ReceiveFunctionCall(flinfo, NULL, typioparam, typmod); >> + return ReceiveFunctionCall(fcinfo->flinfo, NULL, attr->typioparam, attr->typmod); >> >> Why pre-initialized fcinfo isn't used here? > > Because it's a prototype and because I don't think it's a common path. 0008 and 0010 (a bit) are the only patches of the set that touch some of the areas that would be impacted by the refactoring to use callbacks in the COPY code, still I don't see anything that could not be changed in what's updated here, the one-row callback in COPY FROM being the most touched. So I don't quite see why each effort could not happen on their own? Or Andres, do you think that any improvements you've been proposing in this area should happen before we consider refactoring the COPY code to plug in the callbacks? I'm a bit confused by the situation, TBH. -- Michael
Attachment
Hi, In <20240218200906.zvihkrs46yl2juzf@awork3.anarazel.de> "Re: Why is pq_begintypsend so slow?" on Sun, 18 Feb 2024 12:09:06 -0800, Andres Freund <andres@anarazel.de> wrote: >> [1] https://www.postgresql.org/message-id/flat/20240215.153421.96888103784986803.kou%40clear-code.com#34df359e6d255795d16814ce138cc995 >> Do we need one FunctionCallInfoBaseData for each attribute? >> How about sharing one FunctionCallInfoBaseData by all >> attributes like [1]? > > That makes no sense to me. You're throwing away most of the possible gains by > having to update the FunctionCallInfo fields on every call. You're saving > neglegible amounts of memory at a substantial runtime cost. The number of updated fields of your approach and [1] are same: Your approach: 6 (I think that "fcinfo->isnull = false" is needed though.) + fcinfo->args[0].value = CStringGetDatum(string); + fcinfo->args[0].isnull = false; + fcinfo->args[1].value = ObjectIdGetDatum(attr->typioparam); + fcinfo->args[1].isnull = false; + fcinfo->args[2].value = Int32GetDatum(attr->typmod); + fcinfo->args[2].isnull = false; [1]: 6 (including "fcinfo->isnull = false") + fcinfo->flinfo = flinfo; + fcinfo->context = escontext; + fcinfo->isnull = false; + fcinfo->args[0].value = CStringGetDatum(str); + fcinfo->args[1].value = ObjectIdGetDatum(typioparam); + fcinfo->args[2].value = Int32GetDatum(typmod); >> Inlining InputFuncallCallSafe() here to use pre-initialized >> fcinfo will decrease maintainability. Because we abstract >> InputFunctionCall family in fmgr.c. If we define a >> InputFunctionCall variant here, we need to change both of >> fmgr.c and here when InputFunctionCall family is changed. >> How about defining details in fmgr.c and call it here >> instead like [1]? > > I'm not sure I buy that that is a problem. It's not like my approach was > actually bypassing fmgr abstractions alltogether - instead it just used the > lower level APIs, because it's a performance sensitive area. [1] provides some optimized abstractions, which are implemented with lower level APIs, without breaking the abstractions. Note that I don't have a strong opinion how to implement this optimization. If other developers think this approach makes sense for this optimization, I don't object it. >> @@ -1966,7 +1992,7 @@ CopyReadBinaryAttribute(CopyFromState cstate, FmgrInfo *flinfo, >> if (fld_size == -1) >> { >> *isnull = true; >> - return ReceiveFunctionCall(flinfo, NULL, typioparam, typmod); >> + return ReceiveFunctionCall(fcinfo->flinfo, NULL, attr->typioparam, attr->typmod); >> >> Why pre-initialized fcinfo isn't used here? > > Because it's a prototype and because I don't think it's a common path. How about adding a comment why we don't need to optimize this case? I don't have a strong opinion how to implement this optimization as I said above. It seems that you like your approach. So I withdraw [1]. Could you complete this optimization? Can we proceed making COPY format extensible without this optimization? It seems that it'll take a little time to complete this optimization because your patch is still WIP. And it seems that you can work on it after making COPY format extensible. Thanks, -- kou
Hi, On 2024-02-19 10:02:52 +0900, Sutou Kouhei wrote: > In <20240218200906.zvihkrs46yl2juzf@awork3.anarazel.de> > "Re: Why is pq_begintypsend so slow?" on Sun, 18 Feb 2024 12:09:06 -0800, > Andres Freund <andres@anarazel.de> wrote: > > >> [1] https://www.postgresql.org/message-id/flat/20240215.153421.96888103784986803.kou%40clear-code.com#34df359e6d255795d16814ce138cc995 > > >> Do we need one FunctionCallInfoBaseData for each attribute? > >> How about sharing one FunctionCallInfoBaseData by all > >> attributes like [1]? > > > > That makes no sense to me. You're throwing away most of the possible gains by > > having to update the FunctionCallInfo fields on every call. You're saving > > neglegible amounts of memory at a substantial runtime cost. > > The number of updated fields of your approach and [1] are > same: > > Your approach: 6 (I think that "fcinfo->isnull = false" is > needed though.) > > + fcinfo->args[0].value = CStringGetDatum(string); > + fcinfo->args[0].isnull = false; > + fcinfo->args[1].value = ObjectIdGetDatum(attr->typioparam); > + fcinfo->args[1].isnull = false; > + fcinfo->args[2].value = Int32GetDatum(attr->typmod); > + fcinfo->args[2].isnull = false; > > [1]: 6 (including "fcinfo->isnull = false") > > + fcinfo->flinfo = flinfo; > + fcinfo->context = escontext; > + fcinfo->isnull = false; > + fcinfo->args[0].value = CStringGetDatum(str); > + fcinfo->args[1].value = ObjectIdGetDatum(typioparam); > + fcinfo->args[2].value = Int32GetDatum(typmod); If you want to do so you can elide the isnull assignments in my approach just as well as yours. Assigning not just the value but also flinfo and context is overhead. But you can't elide assigning flinfo and context, which is why reusing one FunctionCallInfo isn't going to win I don't think you necessarily need to assign fcinfo->isnull on every call, these functions aren't allowed to return NULL IIRC. And if they do we'd error out, so it could only happen once. > I don't have a strong opinion how to implement this > optimization as I said above. It seems that you like your > approach. So I withdraw [1]. Could you complete this > optimization? Can we proceed making COPY format extensible > without this optimization? It seems that it'll take a little > time to complete this optimization because your patch is > still WIP. And it seems that you can work on it after making > COPY format extensible. I don't think optimizing this aspect needs to block making copy extensible. I don't know how much time/energy I'll have to focus on this in the near term. I really just reposted this because the earlier patches were relevant for the discussion in another thread. If you want to pick the COPY part up, feel free. Greetings, Andres Freund
Hi, In <20240219195351.5vy7cdl3wxia66kg@awork3.anarazel.de> "Re: Why is pq_begintypsend so slow?" on Mon, 19 Feb 2024 11:53:51 -0800, Andres Freund <andres@anarazel.de> wrote: >> I don't have a strong opinion how to implement this >> optimization as I said above. It seems that you like your >> approach. So I withdraw [1]. Could you complete this >> optimization? Can we proceed making COPY format extensible >> without this optimization? It seems that it'll take a little >> time to complete this optimization because your patch is >> still WIP. And it seems that you can work on it after making >> COPY format extensible. > > I don't think optimizing this aspect needs to block making copy extensible. OK. I'll work on making copy extensible without this optimization. > I don't know how much time/energy I'll have to focus on this in the near > term. I really just reposted this because the earlier patches were relevant > for the discussion in another thread. If you want to pick the COPY part up, > feel free. OK. I may work on this after I complete making copy extensible if you haven't completed this yet. Thanks, -- kou