Reduce function call costs on ELF platforms - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Reduce function call costs on ELF platforms |
Date | |
Msg-id | 20211122215048.2ryxchocmtbmnwmp@alap3.anarazel.de Whole thread Raw |
Responses |
Re: Reduce function call costs on ELF platforms
|
List | pgsql-hackers |
Hi, There's two related, but somewhat different aspects to $subject. TL;DR: We use use -fvisibility=hidden + explicit symbol visiblity, -Wl,-Bdynamic, -fno-plt 1) Cross-translation-unit calls in extension library A while ago I was looking at a profile of a workload that spent a good chunk of time in an extension. Looking at the instruction level profile it showed that some of that time was spent doing more-complicated-than-necessary function calls to other functions within the extension. Basically they way we currently build our extensions, the compiler & linker assume every symbol inside the extension libraries needs to be interceptable by the main binary. Which means that all function calls to symbols visible outside the current translation unit need to be made indirectly via the PLT. An example of this (picked from plpgsql, for simplicity) 0000000000024a40 <plpgsql_inline_handler>: { ... func = plpgsql_compile_inline(codeblock->source_text); 24a80: 48 8b 85 a8 fe ff ff mov -0x158(%rbp),%rax 24a87: 48 8b 78 08 mov 0x8(%rax),%rdi 24a8b: e8 20 41 fe ff call 8bb0 <plpgsql_compile_inline@plt> ... 0000000000008bb0 <plpgsql_compile_inline@plt>: 8bb0: ff 25 da ac 02 00 jmp *0x2acda(%rip) # 33890 <plpgsql_compile_inline@@Base+0x24de0> 8bb6: 68 12 01 00 00 push $0x112 8bbb: e9 c0 ee ff ff jmp 7a80 <_init+0x18> I.e. plpgsql_inline_handler doesn't call plpgsql_compile_inline() directly, it calls plpgsql_compile_inline@plt(), which then loads the target address for plpgsql_compile_inline() from the global offset table. Depending on the linker settings / flags passed to dlopen() that'll point to yet another wrapper function (doing a dynamic symbol lookup on the first call, putting the real address in the GOT). This can be addressed to some degree by using explicit symbol visibility markers, as I propose in [1]. With that patch applied compiler / linker know that plpgsql_compile_inline() is not an external symbol, and therefore doesn't need to go through the PLT/GOT. That changes the above to: func = plpgsql_compile_inline(codeblock->source_text); 23000: 48 8b 85 a8 fe ff ff mov -0x158(%rbp),%rax 23007: 48 8b 78 08 mov 0x8(%rax),%rdi 2300b: e8 00 a1 fe ff call d110 <plpgsql_compile_inline> which unsurprisingly is cheaper. 2) Calls to exported functions in extension library However, this does *not* address the issue fully. When an extension calls a function that has to be exported, the symbol with continue to be loaded from the PLT. E.g. hstorePairs() has to be exported, because it's called from transform modules. That results in calls to hstorePairs() from within hstore.so to go through the PLT. e.g. 000000000000e380 <hstore_subscript_assign>: { ... e427: e8 e4 59 ff ff call 3e10 <hstorePairs@plt> In theory we could mark such symbols as "protected" while compiling hstore.so and as "default" otherwise, but that's pretty complicated. And there are some toolchain issues with protected visibility. The easier approach for this class of issues is to use the linker option -Bsymbolic. That turns the above into a plain function call 000000000000e250 <hstore_subscript_assign>: { ... e2f7: e8 f4 a2 ff ff call 85f0 <hstorePairs> As it turns out we already use -Bsymbolic on some platforms (solaris, hpux). But not elsehwere. 3) Function calls from extension library to main binary 4) C library function calls However, even with the above done, calls into shared libraries still go through the PLT. This is particularly annoying for functions like palloc() that are quite performance sensitive and where there's no potential use of intercepting the function call with a different shared library. E.g. the optimized disassembly add_dummy_return() looks like 000000000000bc30 <add_dummy_return>: { ... new = palloc0(sizeof(PLpgSQL_stmt_block)); bc4d: bf 38 00 00 00 mov $0x38,%edi bc52: e8 d9 a7 ff ff call 6430 <palloc0@plt> ... 0000000000006430 <palloc0@plt>: 6430: ff 25 d2 bb 02 00 jmp *0x2bbd2(%rip) # 32008 <palloc0> 6436: 68 01 00 00 00 push $0x1 643b: e9 d0 ff ff ff jmp 6410 <_init+0x20> Obviously we cannot easily avoid indirection entirely in this case. The offset to call palloc0 is not known when plpgsql.so is built. But we don't actually need a two-level indirection. By compiling with -fno-plt, the above becomes: 000000000000b130 <add_dummy_return>: { ... new = palloc0(sizeof(PLpgSQL_stmt_block)); b14d: bf 38 00 00 00 mov $0x38,%edi b152: ff 15 80 66 02 00 call *0x26680(%rip) # 317d8 <palloc0> I.e. a single level of indirection. This has more benefits than just removing one layer of indirection. Here's what gcc's man page says: -fno-plt Do not use the PLT for external function calls in position-independent code. Instead, load the callee address at call sites from the GOT and branch to it. This leads to more efficient code by eliminating PLT stubs and exposing GOT loads to optimizations. In some cases this allows functions to use the sibling-call optimization where that previously was not possible (i.e. for x86 use "jmp" instead of "call" to call another function when that function call is the last thing done in a function, thereby reusing the call frame and reducing the cost of returns). This doesn't just matter for extension libraries. It's also relevant for the main binary (i.e. the upsides are bigger / more widely applicable) - every function call to libc goes through PLT+GOT (well, with a dynamically linked libc). This includes things that are often called in performance critical bits, like strlen. E.g. without -fno-plt raw_parser() calls strlen via the plt: cur_token_length = strlen(yyextra->core_yy_extra.scanbuf + *llocp); 2775a6: 49 63 55 00 movslq 0x0(%r13),%rdx 2775aa: 4c 8b 3b mov (%rbx),%r15 2775ad: 48 89 4d c0 mov %rcx,-0x40(%rbp) 2775b1: 49 8d 3c 17 lea (%r15,%rdx,1),%rdi 2775b5: 48 89 55 c8 mov %rdx,-0x38(%rbp) 2775b9: e8 82 03 e5 ff call c7940 <strlen@plt> but not with -fno-plt: cur_token_length = strlen(yyextra->core_yy_extra.scanbuf + *llocp); 2838e6: 49 63 55 00 movslq 0x0(%r13),%rdx 2838ea: 4c 8b 3b mov (%rbx),%r15 2838ed: 48 89 4d c0 mov %rcx,-0x40(%rbp) 2838f1: 49 8d 3c 17 lea (%r15,%rdx,1),%rdi 2838f5: 48 89 55 c8 mov %rdx,-0x38(%rbp) 2838f9: ff 15 09 45 66 00 call *0x664509(%rip) # 8e7e08 <strlen@GLIBC_2.2.5> I haven't run detailed benchmarks in isolation, but have seen some good results. It obviously is heavily workload dependent. Greetings, Andres Freund [1] https://postgr.es/m/20211101020311.av6hphdl6xbjbuif%40alap3.anarazel.de
pgsql-hackers by date: