Re: elog/ereport VS misleading backtrace_function function address - Mailing list pgsql-hackers

From Jakub Wartak
Subject Re: elog/ereport VS misleading backtrace_function function address
Date
Msg-id CAKZiRmzu+49tA2m-Qkx-fEbzSvCCQZawcuauPtXUK92i6tTCZA@mail.gmail.com
Whole thread Raw
In response to Re: elog/ereport VS misleading backtrace_function function address  (Peter Eisentraut <peter@eisentraut.org>)
Responses Re: elog/ereport VS misleading backtrace_function function address
Re: elog/ereport VS misleading backtrace_function function address
List pgsql-hackers
Hi Peter!

On Sun, May 12, 2024 at 10:33 PM Peter Eisentraut <peter@eisentraut.org> wrote:
>
> On 07.05.24 09:43, Jakub Wartak wrote:
> > NOTE: in case one will be testing this: one cannot ./configure with
> > --enable-debug as it prevents the compiler optimizations that actually
> > end up with the ".cold" branch optimizations that cause backtrace() to
> > return the wrong address.
>
> Is that configuration useful?  If you're interested in backtraces,
> wouldn't you also want debug symbols?

The use case here is that backtrace_functions is unreliable when we
ask customers on production builds to use it (so it's useful not just
for local tests)

> Don't production builds use debug
> symbols nowadays as well?

Reality is apparently mixed,at least from what I have checked :
- all RHEL 7.x/8.x (PGDG and our forks) do NOT come with
--enable-debug according to pg_config.
- on Debian 11/12 PGDG does come with --enable-debug

> >> I recall speculating about whether we could somehow invoke gdb
> >> to get a more comprehensive and accurate backtrace, but I don't
> >> really have a concrete idea how that could be made to work.
> > Oh no, I'm more about micro-fix rather than doing some bigger
> > revolution. The patch simply adds this one instruction in macro, so
> > that now backtrace_functions behaves better:
> >
> >     0x0000000000773d28 <+105>:   call   0x79243f <errfinish>
> >     0x0000000000773d2d <+110>:   movzbl -0x12(%rbp),%eax  << this ends
> > up being added by the patch
> >     0x0000000000773d31 <+114>:   call   0xdc1a0 <abort@plt>
> >
> > I'll put that as for PG18 in CF, but maybe it could be backpatched too
> > - no hard opinion on that (I don't see how that ERROR/FATAL path could
> > cause any performance regressions)
>
> I'm missing a principled explanation of what this does.  I just see that
> it sprinkles some no-op code to make this particular thing work a bit
> differently, but this might just behave differently with the next
> compiler next year.  I'd like to see a bit more of an abstract
> explanation of what is happening here.

OK I'll try to explain using assembly, but I'm not an expert on this.
Let's go to the 1st post, assume we run with backtrace_function set:

get_collation_isdeterministic()  0x5c7680 to 0x5c76c1:
       might jump(!) to 0x14c686 <get_collation_isdeterministic.cold>
       note the two completely different address ranges (hot and separate cold)
       it's because of 913ec71d682 that added the cold branches
optimization via pg_attribute_cold to errstart_cold

Dump of assembler code for function get_collation_isdeterministic:
Address range 0x5c7680 to 0x5c76c1:
   0x00000000005c7680 <+0>:     push   %rbp
[..]
   0x00000000005c7694 <+20>:    call   0x5d63b0 <SearchSysCache1>
   0x00000000005c7699 <+25>:    test   %rax,%rax
   0x00000000005c769c <+28>:    je     0x14c686
<get_collation_isdeterministic.cold>
[..]
   0x00000000005c76b3 <+51>:    call   0x5d6430 <ReleaseSysCache>
   0x00000000005c76b8 <+56>:    mov    %r12d,%eax
   0x00000000005c76bb <+59>:    mov    -0x8(%rbp),%r12
   0x00000000005c76bf <+63>:    leave
   0x00000000005c76c0 <+64>:    ret
Address range 0x14c686 to 0x14c6bb:
                                    // note here that the last
instruction is 0x14c6b6 not 0x14c6bb(!)
                                    // note this cold path also has no
frame pointer setup
   0x000000000014c686 <-4698106>:       xor    %esi,%esi
   0x000000000014c688 <-4698104>:       mov    $0x15,%edi
   0x000000000014c68d <-4698099>:       call   0x14ec86 <errstart_cold>
   0x000000000014c692 <-4698094>:       mov    %r12d,%esi
   0x000000000014c695 <-4698091>:       lea    0x5028dc(%rip),%rdi
   # 0x64ef78
   0x000000000014c69c <-4698084>:       xor    %eax,%eax
   0x000000000014c69e <-4698082>:       call   0x5df320 <errmsg_internal>
   0x000000000014c6a3 <-4698077>:       lea    0x6311a6(%rip),%rdx
   # 0x77d850 <__func__.34>
   0x000000000014c6aa <-4698070>:       mov    $0x440,%esi
   0x000000000014c6af <-4698065>:       lea    0x630c8a(%rip),%rdi
   # 0x77d340
   0x000000000014c6b6 <-4698058>:       call   0x5df100 <errfinish>
End of assembler dump.

so it's
    get_collation_isdeterministic() ->
        get_collation_isdeterministic.cold() [but not real function ->
no proper fp/stack?] ->
            .. ->
                errfinish() ->
                    set_backtrace() // just builds and appends string
using backtrace()/backtrace_functions()/backtrace_symbol_list().
                    prints/logs, finishes

It seems that the thing is that the limited GNU libc
backtrace_symbol_list() won't resolve the 0x14c6b6..0x14c6bb to the
"get_collation_isdeterministic[.cold]" symbol name and it will just
simply put 0x14c6bb in that the text asm. It's wrong and it is
confusing:

2024-03-27 14:39:13.065 CET [12899] postgres(at)test1 ERROR:  cache
lookup failed for collation 0
2024-03-27 14:39:13.065 CET [12899] postgres(at)test1 BACKTRACE:
        postgres: 16/main: postgres test1 [local] DELETE(+0x14c6bb)
[0x5565c5a9c6bb]
        postgres: 16/main: postgres test1 [local]
DELETE(RI_FKey_cascade_del+0x323) [0x5565c5ec0973]
[..]

if you now take addr2line for 0x14c6bb it will be resolved to the next
assembly AFTER where it had happened: we did blow up in
<0x14c686..0x14c6bb), but it put 0x14c6bb as the IP:

$ addr2line -e /usr/lib/postgresql/16/bin/postgres -a -f 0x14c6bb
0x000000000014c6bb
get_language_name.cold

^ it's wrong,! but GDB also gets the same (GIGO):

(gdb) disassemble 0x14c6bb
Dump of assembler code for function get_language_name:
Address range 0x5c7780 to 0x5c77ee:
[..]
   0x00000000005c77da <+90>:    je     0x14c6bb <get_language_name.cold>
[..]
   0x00000000005c77ed <+109>:   ret
Address range 0x14c6bb to 0x14c6f0: // <<< HERE
   0x000000000014c6bb <-4698309>:       xor    %esi,%esi
[..]

So something was saved in the wrong way (fp?), however adding that 1
trivial NOP instruction between errfinish() and pg_unreachable() where
afterwards somehow it returns, does make the returned address (using
addr2line) resolve to the proper function. My limited understanding is
that we simply pushed the compiler into generating that cold path
(that has no %ebp/%rbp frame pointer assembly) followed by no-exit in
asm code that prevents backtrace() from working. So it looks like we
are playing hard at micro-optimizations (cold branch optimizations,
again 913ec71d682), we provide backtrace_functions GUC and we provide
various packaging ways, and that somehow bite me. The backtrace(3)
says `Omission of the frame pointers (as implied by any of gcc(1)'s
nonzero optimization levels) may cause these assumptions to be
violated.`

-J.



pgsql-hackers by date:

Previous
From: Bertrand Drouvot
Date:
Subject: Log details for stats dropped more than once
Next
From: Daniel Gustafsson
Date:
Subject: Re: Fix resource leak (src/backend/libpq/be-secure-common.c)