Reduce function call costs on ELF platforms - Mailing list pgsql-hackers

From Andres Freund
Subject Reduce function call costs on ELF platforms
Date
Msg-id 20211122215048.2ryxchocmtbmnwmp@alap3.anarazel.de
Whole thread Raw
Responses Re: Reduce function call costs on ELF platforms
List pgsql-hackers
Hi,

There's two related, but somewhat different aspects to $subject.

TL;DR: We use use -fvisibility=hidden + explicit symbol visiblity,
-Wl,-Bdynamic, -fno-plt


1) Cross-translation-unit calls in extension library

A while ago I was looking at a profile of a workload that spent a good chunk
of time in an extension. Looking at the instruction level profile it showed
that some of that time was spent doing more-complicated-than-necessary
function calls to other functions within the extension.

Basically they way we currently build our extensions, the compiler & linker
assume every symbol inside the extension libraries needs to be interceptable
by the main binary. Which means that all function calls to symbols visible
outside the current translation unit need to be made indirectly via the PLT.

An example of this (picked from plpgsql, for simplicity)

0000000000024a40 <plpgsql_inline_handler>:
{
...
        func = plpgsql_compile_inline(codeblock->source_text);
   24a80:       48 8b 85 a8 fe ff ff    mov    -0x158(%rbp),%rax
   24a87:       48 8b 78 08             mov    0x8(%rax),%rdi
   24a8b:       e8 20 41 fe ff          call   8bb0 <plpgsql_compile_inline@plt>
...

0000000000008bb0 <plpgsql_compile_inline@plt>:
    8bb0:       ff 25 da ac 02 00       jmp    *0x2acda(%rip)        # 33890 <plpgsql_compile_inline@@Base+0x24de0>
    8bb6:       68 12 01 00 00          push   $0x112
    8bbb:       e9 c0 ee ff ff          jmp    7a80 <_init+0x18>

I.e. plpgsql_inline_handler doesn't call plpgsql_compile_inline() directly, it
calls plpgsql_compile_inline@plt(), which then loads the target address for
plpgsql_compile_inline() from the global offset table. Depending on the linker
settings / flags passed to dlopen() that'll point to yet another wrapper
function (doing a dynamic symbol lookup on the first call, putting the
real address in the GOT).

This can be addressed to some degree by using explicit symbol visibility
markers, as I propose in [1].

With that patch applied compiler / linker know that plpgsql_compile_inline()
is not an external symbol, and therefore doesn't need to go through the
PLT/GOT. That changes the above to:

        func = plpgsql_compile_inline(codeblock->source_text);
   23000:       48 8b 85 a8 fe ff ff    mov    -0x158(%rbp),%rax
   23007:       48 8b 78 08             mov    0x8(%rax),%rdi
   2300b:       e8 00 a1 fe ff          call   d110 <plpgsql_compile_inline>

which unsurprisingly is cheaper.


2) Calls to exported functions in extension library

However, this does *not* address the issue fully. When an extension calls a
function that has to be exported, the symbol with continue to be loaded from
the PLT.

E.g. hstorePairs() has to be exported, because it's called from transform
modules. That results in calls to hstorePairs() from within hstore.so to go
through the PLT. e.g.

000000000000e380 <hstore_subscript_assign>:
{
...
    e427:       e8 e4 59 ff ff          call   3e10 <hstorePairs@plt>


In theory we could mark such symbols as "protected" while compiling hstore.so
and as "default" otherwise, but that's pretty complicated. And there are some
toolchain issues with protected visibility.

The easier approach for this class of issues is to use the linker option
-Bsymbolic. That turns the above into a plain function call

000000000000e250 <hstore_subscript_assign>:
{
...
    e2f7:       e8 f4 a2 ff ff          call   85f0 <hstorePairs>


As it turns out we already use -Bsymbolic on some platforms (solaris,
hpux). But not elsehwere.


3) Function calls from extension library to main binary
4) C library function calls

However, even with the above done, calls into shared libraries still
go through the PLT. This is particularly annoying for functions like palloc()
that are quite performance sensitive and where there's no potential use of
intercepting the function call with a different shared library.

E.g. the optimized disassembly add_dummy_return() looks like

000000000000bc30 <add_dummy_return>:
{
...
                new = palloc0(sizeof(PLpgSQL_stmt_block));
    bc4d:       bf 38 00 00 00          mov    $0x38,%edi
    bc52:       e8 d9 a7 ff ff          call   6430 <palloc0@plt>
...
0000000000006430 <palloc0@plt>:
    6430:       ff 25 d2 bb 02 00       jmp    *0x2bbd2(%rip)        # 32008 <palloc0>
    6436:       68 01 00 00 00          push   $0x1
    643b:       e9 d0 ff ff ff          jmp    6410 <_init+0x20>


Obviously we cannot easily avoid indirection entirely in this case. The offset
to call palloc0 is not known when plpgsql.so is built. But we don't actually
need a two-level indirection.


By compiling with -fno-plt, the above becomes:

000000000000b130 <add_dummy_return>:
{
...
                new = palloc0(sizeof(PLpgSQL_stmt_block));
    b14d:       bf 38 00 00 00          mov    $0x38,%edi
    b152:       ff 15 80 66 02 00       call   *0x26680(%rip)        # 317d8 <palloc0>

I.e. a single level of indirection. This has more benefits than just removing
one layer of indirection. Here's what gcc's man page says:

       -fno-plt
           Do not use the PLT for external function calls in position-independent code.  Instead, load the callee
address
           at call sites from the GOT and branch to it.  This leads to more efficient code by eliminating PLT stubs
and
           exposing GOT loads to optimizations.

In some cases this allows functions to use the sibling-call optimization where
that previously was not possible (i.e. for x86 use "jmp" instead of "call" to
call another function when that function call is the last thing done in a
function, thereby reusing the call frame and reducing the cost of returns).


This doesn't just matter for extension libraries. It's also relevant for the
main binary (i.e. the upsides are bigger / more widely applicable) - every
function call to libc goes through PLT+GOT (well, with a dynamically linked
libc). This includes things that are often called in performance critical
bits, like strlen. E.g. without -fno-plt raw_parser() calls strlen via the
plt:

                        cur_token_length = strlen(yyextra->core_yy_extra.scanbuf + *llocp);
  2775a6:       49 63 55 00             movslq 0x0(%r13),%rdx
  2775aa:       4c 8b 3b                mov    (%rbx),%r15
  2775ad:       48 89 4d c0             mov    %rcx,-0x40(%rbp)
  2775b1:       49 8d 3c 17             lea    (%r15,%rdx,1),%rdi
  2775b5:       48 89 55 c8             mov    %rdx,-0x38(%rbp)
  2775b9:       e8 82 03 e5 ff          call   c7940 <strlen@plt>

but not with -fno-plt:
                        cur_token_length = strlen(yyextra->core_yy_extra.scanbuf + *llocp);
  2838e6:       49 63 55 00             movslq 0x0(%r13),%rdx
  2838ea:       4c 8b 3b                mov    (%rbx),%r15
  2838ed:       48 89 4d c0             mov    %rcx,-0x40(%rbp)
  2838f1:       49 8d 3c 17             lea    (%r15,%rdx,1),%rdi
  2838f5:       48 89 55 c8             mov    %rdx,-0x38(%rbp)
  2838f9:       ff 15 09 45 66 00       call   *0x664509(%rip)        # 8e7e08 <strlen@GLIBC_2.2.5>


I haven't run detailed benchmarks in isolation, but have seen some good
results. It obviously is heavily workload dependent.

Greetings,

Andres Freund

[1] https://postgr.es/m/20211101020311.av6hphdl6xbjbuif%40alap3.anarazel.de



pgsql-hackers by date:

Previous
From: Jeremy Schneider
Date:
Subject: Re: Sequence's value can be rollback after a crashed recovery.
Next
From: Alvaro Herrera
Date:
Subject: Re: LogwrtResult contended spinlock