Re: elog/ereport VS misleading backtrace_function function address - Mailing list pgsql-hackers

From Jakub Wartak
Subject Re: elog/ereport VS misleading backtrace_function function address
Date
Msg-id CAKZiRmyn6Zrr3fAdOeEnSwNA7cqT8hRSFFwQZ9vDx1boZmukkQ@mail.gmail.com
Whole thread Raw
In response to Re: elog/ereport VS misleading backtrace_function function address  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: elog/ereport VS misleading backtrace_function function address
List pgsql-hackers
Hi Tom and -hackers!

On Thu, Mar 28, 2024 at 7:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Jakub Wartak <jakub.wartak@enterprisedb.com> writes:
> > While chasing some other bug I've learned that backtrace_functions
> > might be misleading with top elog/ereport() address.
>
> That was understood from the beginning: this type of backtrace is
> inherently pretty imprecise, and I doubt there is much that can
> be done to make it better.  IIRC the fundamental problem is it only
> looks at global symbols, so static functions inherently defeat it.
> It was argued that this is better than nothing, which is true, but
> you have to take the info with a mountain of salt.

OK, point taken, thanks for describing the limitations, still I find
backtrace_functions often the best thing we have primarily due its
simplicity and ease of use (for customers). I found out simplest
portable way to generate NOP (or any other instruction that makes the
problem go away):

with the reproducer, psql returns:

psql:ri-collation-bug-example.sql:48: error: ERROR:  XX000: cache
lookup failed for collation 0
LOCATION:  get_collation_isdeterministic, lsyscache.c:1062

logfile without patch:

2024-05-07 09:05:43.043 CEST [44720] ERROR:  cache lookup failed for collation 0
2024-05-07 09:05:43.043 CEST [44720] BACKTRACE:
        postgres: postgres postgres [local] DELETE(+0x155883) [0x55ce5a86a883]
        postgres: postgres postgres [local]
DELETE(RI_FKey_cascade_del+0x2b0) [0x55ce5ac6dfd0]
        postgres: postgres postgres [local] DELETE(+0x2d488b) [0x55ce5a9e988b]
[..]

$ addr2line -e /usr/pgsql18/bin/postgres  -a -f 0x155883
0x0000000000155883
get_constraint_type.cold <<<<<< so it's wrong as the real function
should be get_collation_isdeterministic

logfile with attached patch:

2024-05-07 09:11:06.596 CEST [51185] ERROR:  cache lookup failed for collation 0
2024-05-07 09:11:06.596 CEST [51185] BACKTRACE:
        postgres: postgres postgres [local] DELETE(+0x168bf0) [0x560e1cdfabf0]
        postgres: postgres postgres [local]
DELETE(RI_FKey_cascade_del+0x2b0) [0x560e1d200c00]
        postgres: postgres postgres [local] DELETE(+0x2e90fb) [0x560e1cf7b0fb]
[..]

$ addr2line -e /usr/pgsql18/bin/postgres  -a -f 0x168bf0
0x0000000000168bf0
get_collation_isdeterministic.cold <<< It's ok with the patch

NOTE: in case one will be testing this: one cannot ./configure with
--enable-debug as it prevents the compiler optimizations that actually
end up with the ".cold" branch optimizations that cause backtrace() to
return the wrong address.

> I recall speculating about whether we could somehow invoke gdb
> to get a more comprehensive and accurate backtrace, but I don't
> really have a concrete idea how that could be made to work.

Oh no, I'm more about micro-fix rather than doing some bigger
revolution. The patch simply adds this one instruction in macro, so
that now backtrace_functions behaves better:

   0x0000000000773d28 <+105>:   call   0x79243f <errfinish>
   0x0000000000773d2d <+110>:   movzbl -0x12(%rbp),%eax  << this ends
up being added by the patch
   0x0000000000773d31 <+114>:   call   0xdc1a0 <abort@plt>

I'll put that as for PG18 in CF, but maybe it could be backpatched too
- no hard opinion on that (I don't see how that ERROR/FATAL path could
cause any performance regressions)

-J.

Attachment

pgsql-hackers by date:

Previous
From: "Anton A. Melnikov"
Date:
Subject: Re: psql: fix variable existence tab completion
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: cataloguing NOT NULL constraints