Hi Tom and -hackers!
On Thu, Mar 28, 2024 at 7:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Jakub Wartak <jakub.wartak@enterprisedb.com> writes:
> > While chasing some other bug I've learned that backtrace_functions
> > might be misleading with top elog/ereport() address.
>
> That was understood from the beginning: this type of backtrace is
> inherently pretty imprecise, and I doubt there is much that can
> be done to make it better. IIRC the fundamental problem is it only
> looks at global symbols, so static functions inherently defeat it.
> It was argued that this is better than nothing, which is true, but
> you have to take the info with a mountain of salt.
OK, point taken, thanks for describing the limitations, still I find
backtrace_functions often the best thing we have primarily due its
simplicity and ease of use (for customers). I found out simplest
portable way to generate NOP (or any other instruction that makes the
problem go away):
with the reproducer, psql returns:
psql:ri-collation-bug-example.sql:48: error: ERROR: XX000: cache
lookup failed for collation 0
LOCATION: get_collation_isdeterministic, lsyscache.c:1062
logfile without patch:
2024-05-07 09:05:43.043 CEST [44720] ERROR: cache lookup failed for collation 0
2024-05-07 09:05:43.043 CEST [44720] BACKTRACE:
postgres: postgres postgres [local] DELETE(+0x155883) [0x55ce5a86a883]
postgres: postgres postgres [local]
DELETE(RI_FKey_cascade_del+0x2b0) [0x55ce5ac6dfd0]
postgres: postgres postgres [local] DELETE(+0x2d488b) [0x55ce5a9e988b]
[..]
$ addr2line -e /usr/pgsql18/bin/postgres -a -f 0x155883
0x0000000000155883
get_constraint_type.cold <<<<<< so it's wrong as the real function
should be get_collation_isdeterministic
logfile with attached patch:
2024-05-07 09:11:06.596 CEST [51185] ERROR: cache lookup failed for collation 0
2024-05-07 09:11:06.596 CEST [51185] BACKTRACE:
postgres: postgres postgres [local] DELETE(+0x168bf0) [0x560e1cdfabf0]
postgres: postgres postgres [local]
DELETE(RI_FKey_cascade_del+0x2b0) [0x560e1d200c00]
postgres: postgres postgres [local] DELETE(+0x2e90fb) [0x560e1cf7b0fb]
[..]
$ addr2line -e /usr/pgsql18/bin/postgres -a -f 0x168bf0
0x0000000000168bf0
get_collation_isdeterministic.cold <<< It's ok with the patch
NOTE: in case one will be testing this: one cannot ./configure with
--enable-debug as it prevents the compiler optimizations that actually
end up with the ".cold" branch optimizations that cause backtrace() to
return the wrong address.
> I recall speculating about whether we could somehow invoke gdb
> to get a more comprehensive and accurate backtrace, but I don't
> really have a concrete idea how that could be made to work.
Oh no, I'm more about micro-fix rather than doing some bigger
revolution. The patch simply adds this one instruction in macro, so
that now backtrace_functions behaves better:
0x0000000000773d28 <+105>: call 0x79243f <errfinish>
0x0000000000773d2d <+110>: movzbl -0x12(%rbp),%eax << this ends
up being added by the patch
0x0000000000773d31 <+114>: call 0xdc1a0 <abort@plt>
I'll put that as for PG18 in CF, but maybe it could be backpatched too
- no hard opinion on that (I don't see how that ERROR/FATAL path could
cause any performance regressions)
-J.