Thread: Speedup generation of command completion tags

Speedup generation of command completion tags

From
David Rowley
Date:
One of the things I notice which comes up in a profile of a pgbench -S
test is the generation of the command completion tag.

At the moment, this is done by a snprintf() call, namely:

snprintf(completionTag, COMPLETION_TAG_BUFSIZE,
tag == CMDTAG_INSERT ?
"%s 0 " UINT64_FORMAT : "%s " UINT64_FORMAT,
tagname, qc->nprocessed);

Ever since aa2387e2f, and the relevant discussion in [1], it's been
clear that the sprintf() functions are not the fastest. I think
there's just some unavoidable overhead to __VA_ARGS__.

The generation of the completion tag is not hugely dominant in the
profiles, but it does appear:

   0.36%  postgres          [.] dopr.constprop.0

In the attached, there are a few things done to make the generation of
completion tags faster:

Namely:
1. Store the tag length in struct CommandTagBehavior so that we can
memcpy() a fixed length rather than having to copy byte-by-byte
looking for the \0.
2. Use pg_ulltoa_n to write the number of rows affected by the command tag.
3. Have the function that builds the tag return its length so save
from having to do a strlen before writing the tag in pq_putmessage().

It's difficult to measure the performance of something that takes
0.36% of execution. I have previously seen the tag generation take
over 1% of execution time.

One thing that's changed in the patch vs master is that if the
snprintf's buffer, for some reason had not been long enough to store
the entire completion tag, it would have truncated it and perhaps sent
a truncated version of the row count to the client.  For this to
happen, we'd have to have some excessively long command name.  Since
these cannot be added by users, I've opted to just add an Assert that
will trigger if we're ever asked to write a command tag that won't
fit.  Assuming we test our new excessively-long-named-command, we'll
trigger that Assert and realise long before it's a problem.

Both Andres and I seemed to have independently written almost exactly
the same patch for this. The attached is mine but with the
GetCommandTagNameAndLen function from his. I'd written
GetCommandTagLen() for mine, which required a couple of function calls
instead of Andres' 1 function call to get the name and length in one
go.

Does anyone object to this?

David

[1] https://www.postgresql.org/message-id/CAKJS1f8oeW8ZEKqD4X3e+TFwZt+MWV6O-TF8MBpdO4XNNarQvA@mail.gmail.com

Attachment

Re: Speedup generation of command completion tags

From
Andres Freund
Date:
Hi,

On 2022-12-10 20:32:06 +1300, David Rowley wrote:
> @@ -20,13 +20,14 @@
>  typedef struct CommandTagBehavior
>  {
>      const char *name;
> +    const uint8 namelen;

Perhaps worth adding a comment noting that namelen is the length without the
null byte?


> +static inline Size
> +make_completion_tag(char *buff, const QueryCompletion *qc,
> +                    bool force_undecorated_output)
> +{
> +    CommandTag    tag = qc->commandTag;
> +    Size        taglen;
> +    const char *tagname = GetCommandTagNameAndLen(tag, &taglen);
> +    char       *bufp;
> +
> +    /*
> +     * We assume the tagname is plain ASCII and therefore requires no encoding
> +     * conversion.
> +     */
> +    memcpy(buff, tagname, taglen + 1);
> +    bufp = buff + taglen;
> +
> +    /* ensure that the tagname isn't long enough to overrun the buffer */
> +    Assert(taglen <= COMPLETION_TAG_BUFSIZE - MAXINT8LEN - 4);
> +
> +    /*
> +     * In PostgreSQL versions 11 and earlier, it was possible to create a
> +     * table WITH OIDS.  When inserting into such a table, INSERT used to
> +     * include the Oid of the inserted record in the completion tag.  To
> +     * maintain compatibility in the wire protocol, we now write a "0" (for
> +     * InvalidOid) in the location where we once wrote the new record's Oid.
> +     */
> +    if (command_tag_display_rowcount(tag) && !force_undecorated_output)

This does another external function call to cmdtag.c...

What about moving make_completion_tag() to cmdtag.c? Then we could just get
the entire CommandTagBehaviour struct at once. It's not super pretty to pass
QueryCompletion to a routine in cmdtag.c, but it's not awful. And if we deem
it problematic, we could just pass qc->commandTag, qc->nprocessed as a
separate arguments.

I wonder if any of the other GetCommandTagName() would benefit noticably from
not having to compute the length. I guess the calls
set_ps_display(GetCommandTagName()) calls in exec_simple_query() and
exec_execute_message() might, although set_ps_display() isn't exactly zero
overhead. But I do see it show up as a few percent in profiles, with the
biggest contributor being the call to strlen.

Greetings,

Andres Freund



Re: Speedup generation of command completion tags

From
David Rowley
Date:
Thanks for having a look at this.

On Sun, 11 Dec 2022 at 11:05, Andres Freund <andres@anarazel.de> wrote:
>
> On 2022-12-10 20:32:06 +1300, David Rowley wrote:
> > @@ -20,13 +20,14 @@
> >  typedef struct CommandTagBehavior
> >  {
> >       const char *name;
> > +     const uint8 namelen;
>
> Perhaps worth adding a comment noting that namelen is the length without the
> null byte?

I've added that now plus a few other fields that could be easily
documented. I left out commenting all of the remaining fields as
documenting table_rewrite_ok seemed slightly more than this patch
should be doing.  There's commands that rewrite tables, e.g. VACUUM
FULL that have that as false. Looks like this is only used for
commands that *might* rewrite the table. I didn't want to tackle that
in this patch.

> What about moving make_completion_tag() to cmdtag.c? Then we could just get
> the entire CommandTagBehaviour struct at once. It's not super pretty to pass
> QueryCompletion to a routine in cmdtag.c, but it's not awful. And if we deem
> it problematic, we could just pass qc->commandTag, qc->nprocessed as a
> separate arguments.

That seems like a good idea. I've renamed and moved the function in
the attached. I also adjusted how the trailing NUL char is appended to
avoid having to calculate len + 1 and append the NUL char twice for
the most commonly taken path.

> I wonder if any of the other GetCommandTagName() would benefit noticably from
> not having to compute the length. I guess the calls
> set_ps_display(GetCommandTagName()) calls in exec_simple_query() and
> exec_execute_message() might, although set_ps_display() isn't exactly zero
> overhead. But I do see it show up as a few percent in profiles, with the
> biggest contributor being the call to strlen.

I think that could be improved for sure.  It does seem like we'd need
to add set_ps_display_with_len() to make what you said work.  There's
probably lower hanging fruit in that function that could be fixed
without having to do that, however.  For example:

strlcpy(ps_buffer + ps_buffer_fixed_size, activity,
    ps_buffer_size - ps_buffer_fixed_size);
ps_buffer_cur_len = strlen(ps_buffer);

could be written as:

strlcpy(ps_buffer + ps_buffer_fixed_size, activity,
    ps_buffer_size - ps_buffer_fixed_size);
ps_buffer_cur_len = ps_buffer_fixed_size + Min(strlen(activity),
ps_buffer_size - ps_buffer_fixed_size - 1);

That's pretty horrible to read though.

This sort of thing also makes me think that our investment in having
more usages of strlcpy() and fewer usages of strncpy was partially a
mistake. There are exactly 2 usages of the return value of strlcpy in
our entire source tree. That's about 1% of all calls.  Likely what
would be better is a function that returns the number of bytes
*actually* copied instead of one that returns the number of bytes that
it would have copied if it hadn't run out of space. Such a function
could be defined as:

size_t
strdcpy(char * const dst, const char *src, ptrdiff_t len)
{
    char *dstp = dst;

    while (len-- > 0)
    {
        if ((*dstp = *src++) == '\0')
        {
            *dstp = '\0';
            break;
        }
        dstp++;
    }
    return (dstp - dst);
}

Then we could append to strings like:

char buffer[STRING_SIZE];
char *bufp = buffer;

bufp += strdcpy(bufp, "01", STRING_SIZE - (bufp - buffer));
bufp += strdcpy(bufp, "23", STRING_SIZE - (bufp - buffer));
bufp += strdcpy(bufp, "45", STRING_SIZE - (bufp - buffer));

which allows transformation of the set_ps_display() code to:

pg_buffer_cur_len = ps_buffer_fixed_size;
pg_buffer_cur_len += strdcpy(ps_buffer + ps_buffer_fixed_size,
activity, ps_buffer_size - ps_buffer_fixed_size);

(Assume the name strdcpy as a placeholder name for an actual name that
does not conflict with something that it's not.)

I'd rather not go into too much detail about that here though. I don't
see any places that can make use of the known tag length without going
to the trouble of inventing new functions or changing the signature of
existing ones.

David

Attachment

Re: Speedup generation of command completion tags

From
David Rowley
Date:
On Mon, 12 Dec 2022 at 14:48, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Sun, 11 Dec 2022 at 11:05, Andres Freund <andres@anarazel.de> wrote:
> > What about moving make_completion_tag() to cmdtag.c? Then we could just get
> > the entire CommandTagBehaviour struct at once. It's not super pretty to pass
> > QueryCompletion to a routine in cmdtag.c, but it's not awful. And if we deem
> > it problematic, we could just pass qc->commandTag, qc->nprocessed as a
> > separate arguments.
>
> That seems like a good idea. I've renamed and moved the function in
> the attached. I also adjusted how the trailing NUL char is appended to
> avoid having to calculate len + 1 and append the NUL char twice for
> the most commonly taken path.

I've pushed the updated patch. Thanks for having a look.

David