Avoid stack frame setup in performance critical routines using tail calls - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Avoid stack frame setup in performance critical routines using tail calls |
Date | |
Msg-id | 20210719195950.gavgs6ujzmjfaiig@alap3.anarazel.de Whole thread Raw |
Responses |
Re: Avoid stack frame setup in performance critical routines using tail calls
Re: Avoid stack frame setup in performance critical routines using tail calls |
List | pgsql-hackers |
Hi, We have a few routines that are taking up a meaningful share of nearly all workloads. They are worth micro-optimizing, even though they rarely the most expensive parts of a profile. The memory allocation infrastructure is an example of that. When looking at a profile one can often see that a measurable percentage of the time is spent doing stack frame setup in functions like palloc(), AllocSetAlloc(). E.g. here's a perf profile annotating palloc(), the first column showing the percentage of the time the relevant instruction was sampled: │ void * │ palloc(Size size) │ { 11.62 │ push %rbp 5.89 │ mov %rsp,%rbp 11.79 │ push %r12 │ mov %rdi,%r12 6.07 │ push %rbx │ /* duplicates MemoryContextAlloc to avoid increased overhead */ │ void *ret; │ MemoryContext context = CurrentMemoryContext; │ mov CurrentMemoryContext,%rbx │ │ AssertArg(MemoryContextIsValid(context)); │ AssertNotInCriticalSection(context); │ │ if (!AllocSizeIsValid(size)) 5.86 │ cmp $0x3fffffff,%rdi │ → ja 14fa5b <palloc.cold> │ elog(ERROR, "invalid memory alloc request size %zu", size); │ │ context->isReset = false; 17.71 │ movb $0x0,0x4(%rbx) │ │ ret = context->methods->alloc(context, size); 5.98 │ mov 0x10(%rbx),%rax │ mov %rdi,%rsi │ mov %rbx,%rdi 35.08 │ → callq *(%rax) The stack frame setup bit is the push ... bit. At least on x86-64 unixoid systems, that overhead can be avoided in certain circumstances! The simplest case is if the function doesn't do any function calls of its own. If simple enough (i.e. no register spilling), the compiler will just not set up a stack frame - nobody could need it. The slightly more complicated case is that of a function that only does a "tail call", i.e. the only function call is just before returning (there can be multiple such paths though). If the function is simple enough, gcc/clang will then not use the "call" instruction to call the function (which would require a proper stack frame being set up), but instead just jump to the other function. Which ends up reusing the current function's stack frame, basically. When that called function returns using 'ret', it'll reuse the location pushed onto the call stack by the caller of the "original" function, and return to its caller. Having optimized away the need to maintain one stack frame level, and one indirection when returning from the inner function (which just would do its own ret). For that to work, there obviously cannot be any instructions in the function before calling the inner function. Which brings us back to the palloc example from above. As an experiment, if i change the code for palloc() to omit the if (ret == NULL) check, the assembly (omitting source for brevity) from: 61c9a0: 55 push %rbp 61c9a1: 48 89 e5 mov %rsp,%rbp 61c9a4: 41 54 push %r12 61c9a6: 49 89 fc mov %rdi,%r12 61c9a9: 53 push %rbx 61c9aa: 48 8b 1d 2f f2 2a 00 mov 0x2af22f(%rip),%rbx # 8cbbe0 <CurrentMemoryContext> 61c9b1: 48 81 ff ff ff ff 3f cmp $0x3fffffff,%rdi 61c9b8: 0f 87 9d 30 b3 ff ja 14fa5b <palloc.cold> 61c9be: c6 43 04 00 movb $0x0,0x4(%rbx) 61c9c2: 48 8b 43 10 mov 0x10(%rbx),%rax 61c9c6: 48 89 fe mov %rdi,%rsi 61c9c9: 48 89 df mov %rbx,%rdi 61c9cc: ff 10 callq *(%rax) 61c9ce: 48 85 c0 test %rax,%rax 61c9d1: 0f 84 b9 30 b3 ff je 14fa90 <palloc.cold+0x35> 61c9d7: 5b pop %rbx 61c9d8: 41 5c pop %r12 61c9da: 5d pop %rbp 61c9db: c3 retq to 61c8c0: 48 89 fe mov %rdi,%rsi 61c8c3: 48 8b 3d 16 f3 2a 00 mov 0x2af316(%rip),%rdi # 8cbbe0 <CurrentMemoryContext> 61c8ca: 48 81 fe ff ff ff 3f cmp $0x3fffffff,%rsi 61c8d1: 0f 87 c3 31 b3 ff ja 14fa9a <palloc.cold> 61c8d7: c6 47 04 00 movb $0x0,0x4(%rdi) 61c8db: 48 8b 47 10 mov 0x10(%rdi),%rax 61c8df: ff 20 jmpq *(%rax) It's not hard to see why that would be faster, I think. Of course, we cannot just omit that check. But I think this is an argument for why it is not a great idea to have such a check in palloc() - it prevents the use of the above optimization, and it adds a branch to a performance critical function, though there already existing branches in aset.c etc that specifically know about this case. The code in palloc() does this check after context->methods->alloc() since 3d6d1b585524: Move out-of-memory error checks from aset.c to mcxt.c Of course, that commit changed things for a reason: It allows palloc_extended() to exist. However, it seems that the above optimization, as well as the desire to avoid redundant branches (checking for allocation failures in AllocSetAlloc() and then again in palloc() etc) in critical paths, suggests pushing the handling of MCXT_ALLOC_NO_OOM (and perhaps others) a layer down, into the memory context implementations. Which of course means that we would need to pass down MCXT_ALLOC_NO_OOM into at least MemoryContextMethods->alloc,realloc}. But that seems like a good idea to me anyway. That way we could pass down further information as well, e.g. about required alignment. Of course it'd make sense to avoid duplicating the same error message across all contexts, but that could be addressed using a mcxt.c helper function to deal with the allocation failure case. E.g the existing cases like block = (AllocBlock) malloc(blksize); if (block == NULL) return NULL; could become something like block = (AllocBlock) malloc(blksize); if (unlikely(block == NULL)) return MemoryContextAllocationFailure(context, size, flags); The trick of avoiding stack frame setup does not just apply to wrapper functions like palloc(). It even can apply to AllocSetAlloc() itself! If one separates out the "slow paths" from the "fast paths" of AllocSetAlloc(), the fast path can avoid needing the stack frame, for the price of the slow paths being a tiny bit slower. Often the generated code turns out to be better, because the register allocation pressure is lower in the fast path. For that to work, the paths of AllocSetAlloc() that call malloc() need to be separated out. As we obviously need to process malloc()'s result, the call to malloc cannot be a tail call. So we need to split out two paths: 1) handling of large allocations 2) running out of space in the current block / having no block To actually benefit from the optimization, those paths need to actually return the allocated memory. And they need to be marked pg_noinline, otherwise the compiler won't get the message... I think this actually makes the aset.c code a good bit more readable, and highlights where in AllocSetAlloc() adding instructions hurts, and where its fine. I have *not* carefully benchmarked this, but a quick implementation of this does seem to increase readonly pgbench tps at a small scale by 2-3% (both -Mprepared/simple). Despite not being an all that pgbench bound workload. Rough prototype patch for the above attached. Comments? A slightly tangential improvement would be to move the memset() in palloc0() et al do into a static inline. There's two benefits of that: 1) compilers can generate much better code for memset() if the length is known - instead of a function call with length dispatch replace that with a handful of instructions doing the zeroing for the precise length. 2) compilers can often optimize away [part of ]the overhead of needing to do the memset, as many callers will go on to overwrite a good portion of the zeroed data. Greetings, Andres Freund
Attachment
pgsql-hackers by date: