Avoid stack frame setup in performance critical routines using tail calls - Mailing list pgsql-hackers

From Andres Freund
Subject Avoid stack frame setup in performance critical routines using tail calls
Date
Msg-id 20210719195950.gavgs6ujzmjfaiig@alap3.anarazel.de
Whole thread Raw
Responses Re: Avoid stack frame setup in performance critical routines using tail calls
Re: Avoid stack frame setup in performance critical routines using tail calls
List pgsql-hackers
Hi,

We have a few routines that are taking up a meaningful share of nearly all
workloads. They are worth micro-optimizing, even though they rarely the most
expensive parts of a profile. The memory allocation infrastructure is an
example of that.

When looking at a profile one can often see that a measurable percentage of
the time is spent doing stack frame setup in functions like palloc(),
AllocSetAlloc(). E.g. here's a perf profile annotating palloc(), the first
column showing the percentage of the time the relevant instruction was
sampled:

       │      void *
       │      palloc(Size size)
       │      {
 11.62 │        push  %rbp
  5.89 │        mov   %rsp,%rbp
 11.79 │        push  %r12
       │        mov   %rdi,%r12
  6.07 │        push  %rbx
       │      /* duplicates MemoryContextAlloc to avoid increased overhead */
       │      void       *ret;
       │      MemoryContext context = CurrentMemoryContext;
       │        mov   CurrentMemoryContext,%rbx
       │
       │      AssertArg(MemoryContextIsValid(context));
       │      AssertNotInCriticalSection(context);
       │
       │      if (!AllocSizeIsValid(size))
  5.86 │        cmp   $0x3fffffff,%rdi
       │      → ja    14fa5b <palloc.cold>
       │      elog(ERROR, "invalid memory alloc request size %zu", size);
       │
       │      context->isReset = false;
 17.71 │        movb  $0x0,0x4(%rbx)
       │
       │      ret = context->methods->alloc(context, size);
  5.98 │        mov   0x10(%rbx),%rax
       │        mov   %rdi,%rsi
       │        mov   %rbx,%rdi
 35.08 │      → callq *(%rax)


The stack frame setup bit is the push ... bit.

At least on x86-64 unixoid systems, that overhead can be avoided in certain
circumstances! The simplest case is if the function doesn't do any function
calls of its own. If simple enough (i.e. no register spilling), the compiler
will just not set up a stack frame - nobody could need it.

The slightly more complicated case is that of a function that only does a
"tail call", i.e. the only function call is just before returning (there can
be multiple such paths though). If the function is simple enough, gcc/clang
will then not use the "call" instruction to call the function (which would
require a proper stack frame being set up), but instead just jump to the other
function. Which ends up reusing the current function's stack frame,
basically. When that called function returns using 'ret', it'll reuse the
location pushed onto the call stack by the caller of the "original" function,
and return to its caller. Having optimized away the need to maintain one stack
frame level, and one indirection when returning from the inner function (which
just would do its own ret).

For that to work, there obviously cannot be any instructions in the function
before calling the inner function. Which brings us back to the palloc example
from above.

As an experiment, if i change the code for palloc() to omit the if (ret == NULL)
check, the assembly (omitting source for brevity) from:

  61c9a0:       55                      push   %rbp
  61c9a1:       48 89 e5                mov    %rsp,%rbp
  61c9a4:       41 54                   push   %r12
  61c9a6:       49 89 fc                mov    %rdi,%r12
  61c9a9:       53                      push   %rbx
  61c9aa:       48 8b 1d 2f f2 2a 00    mov    0x2af22f(%rip),%rbx        # 8cbbe0 <CurrentMemoryContext>
  61c9b1:       48 81 ff ff ff ff 3f    cmp    $0x3fffffff,%rdi
  61c9b8:       0f 87 9d 30 b3 ff       ja     14fa5b <palloc.cold>
  61c9be:       c6 43 04 00             movb   $0x0,0x4(%rbx)
  61c9c2:       48 8b 43 10             mov    0x10(%rbx),%rax
  61c9c6:       48 89 fe                mov    %rdi,%rsi
  61c9c9:       48 89 df                mov    %rbx,%rdi
  61c9cc:       ff 10                   callq  *(%rax)
  61c9ce:       48 85 c0                test   %rax,%rax
  61c9d1:       0f 84 b9 30 b3 ff       je     14fa90 <palloc.cold+0x35>
  61c9d7:       5b                      pop    %rbx
  61c9d8:       41 5c                   pop    %r12
  61c9da:       5d                      pop    %rbp
  61c9db:       c3                      retq

to

  61c8c0:       48 89 fe                mov    %rdi,%rsi
  61c8c3:       48 8b 3d 16 f3 2a 00    mov    0x2af316(%rip),%rdi        # 8cbbe0 <CurrentMemoryContext>
  61c8ca:       48 81 fe ff ff ff 3f    cmp    $0x3fffffff,%rsi
  61c8d1:       0f 87 c3 31 b3 ff       ja     14fa9a <palloc.cold>
  61c8d7:       c6 47 04 00             movb   $0x0,0x4(%rdi)
  61c8db:       48 8b 47 10             mov    0x10(%rdi),%rax
  61c8df:       ff 20                   jmpq   *(%rax)

It's not hard to see why that would be faster, I think.


Of course, we cannot just omit that check. But I think this is an argument for
why it is not a great idea to have such a check in palloc() - it prevents the
use of the above optimization, and it adds a branch to a performance critical
function, though there already existing branches in aset.c etc that
specifically know about this case.

The code in palloc() does this check after context->methods->alloc() since
3d6d1b585524: Move out-of-memory error checks from aset.c to mcxt.c

Of course, that commit changed things for a reason: It allows
palloc_extended() to exist.

However, it seems that the above optimization, as well as the desire to avoid
redundant branches (checking for allocation failures in AllocSetAlloc() and
then again in palloc() etc) in critical paths, suggests pushing the handling
of MCXT_ALLOC_NO_OOM (and perhaps others) a layer down, into the memory
context implementations. Which of course means that we would need to pass down
MCXT_ALLOC_NO_OOM into at least MemoryContextMethods->alloc,realloc}. But that
seems like a good idea to me anyway. That way we could pass down further
information as well, e.g. about required alignment.

Of course it'd make sense to avoid duplicating the same error message across
all contexts, but that could be addressed using a mcxt.c helper function to
deal with the allocation failure case.

E.g the existing cases like

    block = (AllocBlock) malloc(blksize);
    if (block == NULL)
        return NULL;

could become something like
    block = (AllocBlock) malloc(blksize);
    if (unlikely(block == NULL))
        return MemoryContextAllocationFailure(context, size, flags);


The trick of avoiding stack frame setup does not just apply to wrapper
functions like palloc(). It even can apply to AllocSetAlloc() itself! If one
separates out the "slow paths" from the "fast paths" of AllocSetAlloc(), the
fast path can avoid needing the stack frame, for the price of the slow paths
being a tiny bit slower. Often the generated code turns out to be better,
because the register allocation pressure is lower in the fast path.

For that to work, the paths of AllocSetAlloc() that call malloc() need to be
separated out. As we obviously need to process malloc()'s result, the call to
malloc cannot be a tail call. So we need to split out two paths:
1) handling of large allocations
2) running out of space in the current block / having no block

To actually benefit from the optimization, those paths need to actually return
the allocated memory. And they need to be marked pg_noinline, otherwise the
compiler won't get the message...

I think this actually makes the aset.c code a good bit more readable, and
highlights where in AllocSetAlloc() adding instructions hurts, and where its
fine.

I have *not* carefully benchmarked this, but a quick implementation of this
does seem to increase readonly pgbench tps at a small scale by 2-3% (both
-Mprepared/simple). Despite not being an all that pgbench bound workload.


Rough prototype patch for the above attached.

Comments?


A slightly tangential improvement would be to move the memset() in palloc0()
et al do into a static inline. There's two benefits of that:

1) compilers can generate much better code for memset() if the length is known
   - instead of a function call with length dispatch replace that with a
   handful of instructions doing the zeroing for the precise length.

2) compilers can often optimize away [part of ]the overhead of needing to do
   the memset, as many callers will go on to overwrite a good portion of the
   zeroed data.


Greetings,

Andres Freund

Attachment

pgsql-hackers by date:

Previous
From: Jacob Champion
Date:
Subject: Re: Support for NSS as a libpq TLS backend
Next
From: Andres Freund
Date:
Subject: Re: slab allocator performance issues