Thread: Speedup generation of command completion tags
One of the things I notice which comes up in a profile of a pgbench -S test is the generation of the command completion tag. At the moment, this is done by a snprintf() call, namely: snprintf(completionTag, COMPLETION_TAG_BUFSIZE, tag == CMDTAG_INSERT ? "%s 0 " UINT64_FORMAT : "%s " UINT64_FORMAT, tagname, qc->nprocessed); Ever since aa2387e2f, and the relevant discussion in [1], it's been clear that the sprintf() functions are not the fastest. I think there's just some unavoidable overhead to __VA_ARGS__. The generation of the completion tag is not hugely dominant in the profiles, but it does appear: 0.36% postgres [.] dopr.constprop.0 In the attached, there are a few things done to make the generation of completion tags faster: Namely: 1. Store the tag length in struct CommandTagBehavior so that we can memcpy() a fixed length rather than having to copy byte-by-byte looking for the \0. 2. Use pg_ulltoa_n to write the number of rows affected by the command tag. 3. Have the function that builds the tag return its length so save from having to do a strlen before writing the tag in pq_putmessage(). It's difficult to measure the performance of something that takes 0.36% of execution. I have previously seen the tag generation take over 1% of execution time. One thing that's changed in the patch vs master is that if the snprintf's buffer, for some reason had not been long enough to store the entire completion tag, it would have truncated it and perhaps sent a truncated version of the row count to the client. For this to happen, we'd have to have some excessively long command name. Since these cannot be added by users, I've opted to just add an Assert that will trigger if we're ever asked to write a command tag that won't fit. Assuming we test our new excessively-long-named-command, we'll trigger that Assert and realise long before it's a problem. Both Andres and I seemed to have independently written almost exactly the same patch for this. The attached is mine but with the GetCommandTagNameAndLen function from his. I'd written GetCommandTagLen() for mine, which required a couple of function calls instead of Andres' 1 function call to get the name and length in one go. Does anyone object to this? David [1] https://www.postgresql.org/message-id/CAKJS1f8oeW8ZEKqD4X3e+TFwZt+MWV6O-TF8MBpdO4XNNarQvA@mail.gmail.com
Attachment
Hi, On 2022-12-10 20:32:06 +1300, David Rowley wrote: > @@ -20,13 +20,14 @@ > typedef struct CommandTagBehavior > { > const char *name; > + const uint8 namelen; Perhaps worth adding a comment noting that namelen is the length without the null byte? > +static inline Size > +make_completion_tag(char *buff, const QueryCompletion *qc, > + bool force_undecorated_output) > +{ > + CommandTag tag = qc->commandTag; > + Size taglen; > + const char *tagname = GetCommandTagNameAndLen(tag, &taglen); > + char *bufp; > + > + /* > + * We assume the tagname is plain ASCII and therefore requires no encoding > + * conversion. > + */ > + memcpy(buff, tagname, taglen + 1); > + bufp = buff + taglen; > + > + /* ensure that the tagname isn't long enough to overrun the buffer */ > + Assert(taglen <= COMPLETION_TAG_BUFSIZE - MAXINT8LEN - 4); > + > + /* > + * In PostgreSQL versions 11 and earlier, it was possible to create a > + * table WITH OIDS. When inserting into such a table, INSERT used to > + * include the Oid of the inserted record in the completion tag. To > + * maintain compatibility in the wire protocol, we now write a "0" (for > + * InvalidOid) in the location where we once wrote the new record's Oid. > + */ > + if (command_tag_display_rowcount(tag) && !force_undecorated_output) This does another external function call to cmdtag.c... What about moving make_completion_tag() to cmdtag.c? Then we could just get the entire CommandTagBehaviour struct at once. It's not super pretty to pass QueryCompletion to a routine in cmdtag.c, but it's not awful. And if we deem it problematic, we could just pass qc->commandTag, qc->nprocessed as a separate arguments. I wonder if any of the other GetCommandTagName() would benefit noticably from not having to compute the length. I guess the calls set_ps_display(GetCommandTagName()) calls in exec_simple_query() and exec_execute_message() might, although set_ps_display() isn't exactly zero overhead. But I do see it show up as a few percent in profiles, with the biggest contributor being the call to strlen. Greetings, Andres Freund
Thanks for having a look at this. On Sun, 11 Dec 2022 at 11:05, Andres Freund <andres@anarazel.de> wrote: > > On 2022-12-10 20:32:06 +1300, David Rowley wrote: > > @@ -20,13 +20,14 @@ > > typedef struct CommandTagBehavior > > { > > const char *name; > > + const uint8 namelen; > > Perhaps worth adding a comment noting that namelen is the length without the > null byte? I've added that now plus a few other fields that could be easily documented. I left out commenting all of the remaining fields as documenting table_rewrite_ok seemed slightly more than this patch should be doing. There's commands that rewrite tables, e.g. VACUUM FULL that have that as false. Looks like this is only used for commands that *might* rewrite the table. I didn't want to tackle that in this patch. > What about moving make_completion_tag() to cmdtag.c? Then we could just get > the entire CommandTagBehaviour struct at once. It's not super pretty to pass > QueryCompletion to a routine in cmdtag.c, but it's not awful. And if we deem > it problematic, we could just pass qc->commandTag, qc->nprocessed as a > separate arguments. That seems like a good idea. I've renamed and moved the function in the attached. I also adjusted how the trailing NUL char is appended to avoid having to calculate len + 1 and append the NUL char twice for the most commonly taken path. > I wonder if any of the other GetCommandTagName() would benefit noticably from > not having to compute the length. I guess the calls > set_ps_display(GetCommandTagName()) calls in exec_simple_query() and > exec_execute_message() might, although set_ps_display() isn't exactly zero > overhead. But I do see it show up as a few percent in profiles, with the > biggest contributor being the call to strlen. I think that could be improved for sure. It does seem like we'd need to add set_ps_display_with_len() to make what you said work. There's probably lower hanging fruit in that function that could be fixed without having to do that, however. For example: strlcpy(ps_buffer + ps_buffer_fixed_size, activity, ps_buffer_size - ps_buffer_fixed_size); ps_buffer_cur_len = strlen(ps_buffer); could be written as: strlcpy(ps_buffer + ps_buffer_fixed_size, activity, ps_buffer_size - ps_buffer_fixed_size); ps_buffer_cur_len = ps_buffer_fixed_size + Min(strlen(activity), ps_buffer_size - ps_buffer_fixed_size - 1); That's pretty horrible to read though. This sort of thing also makes me think that our investment in having more usages of strlcpy() and fewer usages of strncpy was partially a mistake. There are exactly 2 usages of the return value of strlcpy in our entire source tree. That's about 1% of all calls. Likely what would be better is a function that returns the number of bytes *actually* copied instead of one that returns the number of bytes that it would have copied if it hadn't run out of space. Such a function could be defined as: size_t strdcpy(char * const dst, const char *src, ptrdiff_t len) { char *dstp = dst; while (len-- > 0) { if ((*dstp = *src++) == '\0') { *dstp = '\0'; break; } dstp++; } return (dstp - dst); } Then we could append to strings like: char buffer[STRING_SIZE]; char *bufp = buffer; bufp += strdcpy(bufp, "01", STRING_SIZE - (bufp - buffer)); bufp += strdcpy(bufp, "23", STRING_SIZE - (bufp - buffer)); bufp += strdcpy(bufp, "45", STRING_SIZE - (bufp - buffer)); which allows transformation of the set_ps_display() code to: pg_buffer_cur_len = ps_buffer_fixed_size; pg_buffer_cur_len += strdcpy(ps_buffer + ps_buffer_fixed_size, activity, ps_buffer_size - ps_buffer_fixed_size); (Assume the name strdcpy as a placeholder name for an actual name that does not conflict with something that it's not.) I'd rather not go into too much detail about that here though. I don't see any places that can make use of the known tag length without going to the trouble of inventing new functions or changing the signature of existing ones. David
Attachment
On Mon, 12 Dec 2022 at 14:48, David Rowley <dgrowleyml@gmail.com> wrote: > > On Sun, 11 Dec 2022 at 11:05, Andres Freund <andres@anarazel.de> wrote: > > What about moving make_completion_tag() to cmdtag.c? Then we could just get > > the entire CommandTagBehaviour struct at once. It's not super pretty to pass > > QueryCompletion to a routine in cmdtag.c, but it's not awful. And if we deem > > it problematic, we could just pass qc->commandTag, qc->nprocessed as a > > separate arguments. > > That seems like a good idea. I've renamed and moved the function in > the attached. I also adjusted how the trailing NUL char is appended to > avoid having to calculate len + 1 and append the NUL char twice for > the most commonly taken path. I've pushed the updated patch. Thanks for having a look. David