Thread: PATCH: backtraces for error messages

PATCH: backtraces for error messages

From
Craig Ringer
Date:
Hi folks

I recently needed a way to get backtraces from errors in a convenient, non-interactive and indescriminate way. The attached patch is the result. It teaches Pg to use libunwind to self-backtrace in elog/ereport. 

Anyone think this is useful/interesting/worth pursuing?

(Patch is currently against pg10, so this is a PoC only).

As written it emits a backtrace when log_error_verbosity=verbose or, unconditionally, on PANIC. A bt can be hidden by errhidestack() but is otherwise shown. That's ridiculously, excessively spammy, so it's not viable for core as-is. Before playing with it too much I thought I'd ask for ideas on if anyone thinks it's useful, and if so, how it'd work best.

My goal is to allow capture of extra diagnostic info from key locations in production without needing to attach gdb. It's imperfect since sometimes there's no convenient message, and other times you can't afford to set up logging with enough detail. So this would be most useful when combined with one of the occasionally discussed patches to allow for selective logging verbosity on a module- or file- level. But I think it's still handy without that. 

I briefly looked into Windows too. Roughly the same approach could be used to plug in dbghelp.dll support for Windows self-backtracing, it's just rather uglier; see https://jpassing.com/2008/03/12/walking-the-stack-of-the-current-thread/ .

BTW, Álvaro posted a simpler patch at https://www.postgresql.org/message-id/20180410213203.nl645o5vj5ap66sl@alvherre.pgsql. It uses backtrace() from glibc, and requires each site you want to bt to be annotated explicitly. I forgot about backtrace() completely when I wrote mine, and I prefer the control libunwind gives me anyway, but the reduced dependency would be nice. Especially since backtrace() is in FreeBSD too.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Attachment

Re: PATCH: backtraces for error messages

From
Alvaro Herrera
Date:
On 2018-Jun-20, Craig Ringer wrote:

> Hi folks
> 
> I recently needed a way to get backtraces from errors in a convenient,
> non-interactive and indescriminate way. The attached patch is the result.
> It teaches Pg to use libunwind to self-backtrace in elog/ereport.
> 
> Anyone think this is useful/interesting/worth pursuing?

I think we sorely need some mechanism to optionally get backtraces in
error messages.  I think having backtraces in all messages is definitely
not a good thing, but requiring an explicit marker (such as in my patch)
means the code has to be recompiled, which is not a solution in
production systems.  I kind lean towards your approach, but it has to be
something that's easily enabled/disabled at runtime.

I have no idea how expensive backtrace() and libunwind are, but I doubt
we want to pay the cost for every message before we know that error
requires the backtrace to be collected.  Something like PGC_USERSET
  server_min_backtraces=PANIC 
might be a possible interface.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: PATCH: backtraces for error messages

From
Alexander Kuzmenkov
Date:
On 06/20/2018 06:10 PM, Craig Ringer wrote:
>
> I recently needed a way to get backtraces from errors in a convenient, 
> non-interactive and indescriminate way. The attached patch is the 
> result. It teaches Pg to use libunwind to self-backtrace in elog/ereport.
>
> Anyone think this is useful/interesting/worth pursuing?
>

This would be a great thing to have. I often need to add stack traces to 
the logs, and for now I just link with libunwind and add some ad-hoc 
function to grab the traces. Having libunwind support in core would make 
this simpler.


-- 
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: PATCH: backtraces for error messages

From
Andres Freund
Date:
Hi,

On 2018-06-20 11:20:49 -0400, Alvaro Herrera wrote:
> > I recently needed a way to get backtraces from errors in a convenient,
> > non-interactive and indescriminate way. The attached patch is the result.
> > It teaches Pg to use libunwind to self-backtrace in elog/ereport.
> > 
> > Anyone think this is useful/interesting/worth pursuing?

Generally interesting, yes.


> I think we sorely need some mechanism to optionally get backtraces in
> error messages.  I think having backtraces in all messages is definitely
> not a good thing, but requiring an explicit marker (such as in my patch)
> means the code has to be recompiled, which is not a solution in
> production systems.  I kind lean towards your approach, but it has to be
> something that's easily enabled/disabled at runtime.

> I have no idea how expensive backtrace() and libunwind are, but I doubt
> we want to pay the cost for every message before we know that error
> requires the backtrace to be collected.  Something like PGC_USERSET
>   server_min_backtraces=PANIC 
> might be a possible interface.

Yes, most definitely. We can't do this everywhere. It's quite expensive
to collect / build them.  So I think we'd probably need another guc that
controls when the information is collected, perhaps defaulting to PANIC.

Greetings,

Andres Freund


Re: PATCH: backtraces for error messages

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2018-06-20 11:20:49 -0400, Alvaro Herrera wrote:
>> I have no idea how expensive backtrace() and libunwind are, but I doubt
>> we want to pay the cost for every message before we know that error
>> requires the backtrace to be collected.  Something like PGC_USERSET
>> server_min_backtraces=PANIC 
>> might be a possible interface.

> Yes, most definitely. We can't do this everywhere. It's quite expensive
> to collect / build them.  So I think we'd probably need another guc that
> controls when the information is collected, perhaps defaulting to PANIC.

The cost is a problem, and the dependencies on various additional
libraries are too.  I wonder whether anything could be gained by
putting this stuff in an extension?  Then we could have different
extensions for different backtrace libraries, and loading the extension
or not would be one avenue to control whether you were paying the cost.
(Though some control GUCs might be needed anyway.)

            regards, tom lane


Re: PATCH: backtraces for error messages

From
Andres Freund
Date:
On 2018-06-20 12:04:51 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2018-06-20 11:20:49 -0400, Alvaro Herrera wrote:
> >> I have no idea how expensive backtrace() and libunwind are, but I doubt
> >> we want to pay the cost for every message before we know that error
> >> requires the backtrace to be collected.  Something like PGC_USERSET
> >> server_min_backtraces=PANIC 
> >> might be a possible interface.
> 
> > Yes, most definitely. We can't do this everywhere. It's quite expensive
> > to collect / build them.  So I think we'd probably need another guc that
> > controls when the information is collected, perhaps defaulting to PANIC.
> 
> The cost is a problem, and the dependencies on various additional
> libraries are too.  I wonder whether anything could be gained by
> putting this stuff in an extension?  Then we could have different
> extensions for different backtrace libraries, and loading the extension
> or not would be one avenue to control whether you were paying the cost.
> (Though some control GUCs might be needed anyway.)

I think the problem is that this most frequently is an issue when
investigating an issue for customers. Often enough it'll legally not be
possible to gain direct access to the system, and remotely instructing
people to install an extension and configure it isn't great.  So if we
could, by default, include something better for PANICs etc, it'd be a
great boon - I think that's even clear from conversionations on the pg
lists (which often will be the more knowledgable people: How often did
we try hard to get a backtrace from a crash?

If we instead had a backtrace enabled for all PANICs and some FATALs by
default (and perhaps a SIGSEGV handler too), we'd be in a better
place. That'd obviously only work when compiled with support for
libraries, on platforms where we added support for that. But I think
that'd be quite the improvement already.

Greetings,

Andres Freund


Re: PATCH: backtraces for error messages

From
Robert Haas
Date:
On Wed, Jun 20, 2018 at 12:10 PM, Andres Freund <andres@anarazel.de> wrote:
> I think the problem is that this most frequently is an issue when
> investigating an issue for customers. Often enough it'll legally not be
> possible to gain direct access to the system, and remotely instructing
> people to install an extension and configure it isn't great.  So if we
> could, by default, include something better for PANICs etc, it'd be a
> great boon - I think that's even clear from conversionations on the pg
> lists (which often will be the more knowledgable people: How often did
> we try hard to get a backtrace from a crash?

+1 to all of that.  This is a real nuisance, and making it less of a
nuisance would be great.

> If we instead had a backtrace enabled for all PANICs and some FATALs by
> default (and perhaps a SIGSEGV handler too), we'd be in a better
> place. That'd obviously only work when compiled with support for
> libraries, on platforms where we added support for that. But I think
> that'd be quite the improvement already.

I think doing it on PANIC would be phenomenal.  SIGSEGV would be great
if we can make it safe enough, which I'm not sure about, but then I
suppose we're crashing anyway.  Instead of making the ERROR behavior
conditional on log_error_verbosity as Craig has it now, how about
doing it whenever the error code is ERRCODE_INTERNAL_ERROR?  That's
pretty much the cases that aren't supposed to happen, so if we see
those happening a lot, it's either a bug we need to fix or we should
supply a better error code.  Also, a lot of those messages are
duplicated in many places and/or occur inside fairly generic functions
inside lsyscache.c, so the actual error message text tends not to be
enough to know what happened.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: PATCH: backtraces for error messages

From
Andres Freund
Date:
On 2018-06-20 13:10:57 -0400, Robert Haas wrote:
> On Wed, Jun 20, 2018 at 12:10 PM, Andres Freund <andres@anarazel.de> wrote:
> > If we instead had a backtrace enabled for all PANICs and some FATALs by
> > default (and perhaps a SIGSEGV handler too), we'd be in a better
> > place. That'd obviously only work when compiled with support for
> > libraries, on platforms where we added support for that. But I think
> > that'd be quite the improvement already.
> 
> I think doing it on PANIC would be phenomenal.  SIGSEGV would be great
> if we can make it safe enough, which I'm not sure about, but then I
> suppose we're crashing anyway.

Yea, I think that's pretty much why It'd be ok.


> Instead of making the ERROR behavior conditional on
> log_error_verbosity as Craig has it now, how about doing it whenever
> the error code is ERRCODE_INTERNAL_ERROR?  That's pretty much the
> cases that aren't supposed to happen, so if we see those happening a
> lot, it's either a bug we need to fix or we should supply a better
> error code.  Also, a lot of those messages are duplicated in many
> places and/or occur inside fairly generic functions inside
> lsyscache.c, so the actual error message text tends not to be enough
> to know what happened.

I don't think that's ok. It's perfectly possible to hit
ERRCODE_INTERNAL_ERROR at a high frequency in some situations, and
there's plenty cases that aren't ERRCODE_INTERNAL_ERROR where we'd want
this. E.g. a lot of generic filesystem errors have
errcode_for_file_access(), but are too generic messages to debug.  So I
think we'll just need a separate GUC for things that aren't PANIC and
haven't explicitly opt-ed in.

Greetings,

Andres Freund


Re: PATCH: backtraces for error messages

From
Daniel Gustafsson
Date:
> On 20 Jun 2018, at 17:10, Craig Ringer <craig@2ndquadrant.com> wrote:

> BTW, Álvaro posted a simpler patch at
https://www.postgresql.org/message-id/20180410213203.nl645o5vj5ap66sl@alvherre.pgsql.It uses backtrace() from glibc,
andrequires each site you want to bt to be annotated explicitly. I forgot about backtrace() completely when I wrote
mine,and I prefer the control libunwind gives me anyway, but the reduced dependency would be nice. Especially since
backtrace()is in FreeBSD too. 

Just as a datapoint regarding this; backtrace() is not available in OpenBSD,
but there is a library in ports which implements it, libexecinfo.

cheers ./daniel

Re: PATCH: backtraces for error messages

From
Robert Haas
Date:
On Wed, Jun 20, 2018 at 1:15 PM, Andres Freund <andres@anarazel.de> wrote:
> I don't think that's ok. It's perfectly possible to hit
> ERRCODE_INTERNAL_ERROR at a high frequency in some situations,

Really?  How?

> and
> there's plenty cases that aren't ERRCODE_INTERNAL_ERROR where we'd want
> this. E.g. a lot of generic filesystem errors have
> errcode_for_file_access(), but are too generic messages to debug.  So I
> think we'll just need a separate GUC for things that aren't PANIC and
> haven't explicitly opt-ed in.

Mmph.  I don't like that much.  I mean I can hold my nose, but I hope
we can find a way to do better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


RE: PATCH: backtraces for error messages

From
"Tsunakawa, Takayuki"
Date:
From: Alvaro Herrera [mailto:alvherre@2ndquadrant.com]
> I have no idea how expensive backtrace() and libunwind are, but I doubt
> we want to pay the cost for every message before we know that error
> requires the backtrace to be collected.  Something like PGC_USERSET
>   server_min_backtraces=PANIC
> might be a possible interface.

+1

In addition, it would be nicer if we could have some filtering based on the error condition.  Sometimes I wanted to
knowwhere the "out of memory" or "invalid memory alloc request size ..." was thrown.
 

How about adding a GUC like the -e option of strace?  strace allows to specify which system calls and groups of them to
capture. We can use the SQL state and class code/name to list the conditions to capture.  In the out-of-memory case, I
wantto specify 53200, but don't want to get stack traces for other errors in class 53.
 

https://www.postgresql.org/docs/devel/static/errcodes-appendix.html

Regards
Takayuki Tsunakawa






Re: PATCH: backtraces for error messages

From
Craig Ringer
Date:
On 21 June 2018 at 01:15, Andres Freund <andres@anarazel.de> wrote:
On 2018-06-20 13:10:57 -0400, Robert Haas wrote:
> On Wed, Jun 20, 2018 at 12:10 PM, Andres Freund <andres@anarazel.de> wrote:
> > If we instead had a backtrace enabled for all PANICs and some FATALs by
> > default (and perhaps a SIGSEGV handler too), we'd be in a better
> > place. That'd obviously only work when compiled with support for
> > libraries, on platforms where we added support for that. But I think
> > that'd be quite the improvement already.
>
> I think doing it on PANIC would be phenomenal.  SIGSEGV would be great
> if we can make it safe enough, which I'm not sure about, but then I
> suppose we're crashing anyway.

Yea, I think that's pretty much why It'd be ok.

Yep. At worst we crash again while trying to generate a bt. We're not doing anything particularly exciting, and it should be sensible enough.

> Instead of making the ERROR behavior conditional on
> log_error_verbosity as Craig has it now, how about doing it whenever
> the error code is ERRCODE_INTERNAL_ERROR?  That's pretty much the
> cases that aren't supposed to happen, so if we see those happening a
> lot, it's either a bug we need to fix or we should supply a better
> error code.  Also, a lot of those messages are duplicated in many
> places and/or occur inside fairly generic functions inside
> lsyscache.c, so the actual error message text tends not to be enough
> to know what happened.

I don't think that's ok. It's perfectly possible to hit
ERRCODE_INTERNAL_ERROR at a high frequency in some situations, and
there's plenty cases that aren't ERRCODE_INTERNAL_ERROR where we'd want
this.

Perhaps we should fix those, but it might be a game of whack-a-mole as the code changes, and inevitably you'll want to generate stacks for some other errcode while getting frustrated at all the ERRCODE_INTERNAL_ERROR. Not sure it's worth it.

However, maybe a  GUC like

log_error_stacks_errcodes = 'XX000, 55000'

would work. It'd be somewhat expensive to evaluate, but we'd only be doing it where we'd already decided to emit an error. And it'd fit in even if we later added smarter selective logging.

BTW, it's worth noting that these backtraces are very limited. They don't report arguments or locals. So it's still no substitute for suitable errcontext callbacks, and sometimes it's still necessary to fall back to gdb or messing around with perf userspace tracepoints.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: PATCH: backtraces for error messages

From
Craig Ringer
Date:
This is what the stacks look like btw


[2018-06-21 12:26:45.309 AWST] [7293] [] [] [:0] DEBUG:  00000: find_in_dynamic_libpath: trying "/home/craig/pg/10/lib/postgresql/pglogical.so"
[2018-06-21 12:26:45.309 AWST] [7293] [] [] [:0] LOCATION:  find_in_dynamic_libpath, dfmgr.c:639
[2018-06-21 12:26:45.309 AWST] [7293] [] [] [:0] STACK: 
    FRAME    0: find_in_dynamic_libpath +628 
    FRAME    1: expand_dynamic_library_name +205 
    FRAME    2: load_external_function +38  
    FRAME    3: LookupBackgroundWorkerFunction +197 
    FRAME    4: StartBackgroundWorker +549 
    FRAME    5: do_start_bgworker +466 
    FRAME    6: maybe_start_bgworkers +382 
    FRAME    7: reaper +895 
    FRAME    8: funlockfile +80  
    FRAME    9: __select +23  
    FRAME   10: ServerLoop +394 
    FRAME   11: PostmasterMain +4499

I wrote it because I got sick of Assert(false) debugging, and I was chasing down some "ERROR:  08P01: insufficient data left in message" errors. Then I got kind of caught up in it... you know how it is.

It also goes to show there are plenty of places you might want to get a stack where you don't have an internal errcode or panic. I don't think that idea will fly.

Re: PATCH: backtraces for error messages

From
Michael Paquier
Date:
On Thu, Jun 21, 2018 at 12:35:10PM +0800, Craig Ringer wrote:
> I wrote it because I got sick of Assert(false) debugging, and I was chasing
> down some "ERROR:  08P01: insufficient data left in message" errors. Then I
> got kind of caught up in it... you know how it is.

Yes, I know that feeling!  I have been using as well the Assert(false)
and the upgrade-to-PANIC tricks a couple of times, so being able to get
more easily backtraces would be really nice.

> It also goes to show there are plenty of places you might want to get a
> stack where you don't have an internal errcode or panic. I don't think that
> idea will fly.

Yep.  Error message uniqueness can help, but getting the call stack
can trace back to more understanding of a problem.
--
Michael

Attachment

Re: PATCH: backtraces for error messages

From
Kyotaro HORIGUCHI
Date:
Hello, I basically like this.

At Thu, 21 Jun 2018 12:35:10 +0800, Craig Ringer <craig@2ndquadrant.com> wrote in
<CAMsr+YGFMVnFLRSf09GGgjUyxv3C0ytep9ftM8N2X1kLM-SkKw@mail.gmail.com>
> This is what the stacks look like btw
> 
> 
> [2018-06-21 12:26:45.309 AWST] [7293] [] [] [:0] DEBUG:  00000:
> find_in_dynamic_libpath: trying
> "/home/craig/pg/10/lib/postgresql/pglogical.so"
> [2018-06-21 12:26:45.309 AWST] [7293] [] [] [:0] LOCATION:
> find_in_dynamic_libpath, dfmgr.c:639
> [2018-06-21 12:26:45.309 AWST] [7293] [] [] [:0] STACK:
>     FRAME    0: find_in_dynamic_libpath +628
>     FRAME    1: expand_dynamic_library_name +205
>     FRAME    2: load_external_function +38
>     FRAME    3: LookupBackgroundWorkerFunction +197
>     FRAME    4: StartBackgroundWorker +549
>     FRAME    5: do_start_bgworker +466
>     FRAME    6: maybe_start_bgworkers +382
>     FRAME    7: reaper +895
>     FRAME    8: funlockfile +80
>     FRAME    9: __select +23
>     FRAME   10: ServerLoop +394
>     FRAME   11: PostmasterMain +4499
> 
> I wrote it because I got sick of Assert(false) debugging, and I was chasing
> down some "ERROR:  08P01: insufficient data left in message" errors. Then I
> got kind of caught up in it... you know how it is.
> 
> It also goes to show there are plenty of places you might want to get a
> stack where you don't have an internal errcode or panic. I don't think that
> idea will fly.

I think this for assertion failure is no problem but I'm not sure
for other cases. We could set proper context description or other
additional information in error messages before just dumping a
trace for known cases.

Since PG9.5 RPMs are complied with --enable-dtrace so we could
use system tap to insert the stack-trace stuff. Additional
built-in probe in error reporting system or assertions would make
this pluggable.

However, system tap doesn't work for me but I'm not sure how it
is broken. (stap version is 3.2/0.170 and uname -r shows
3.10.0-862)

$ sudo stap -e  'probe process(".../bin/postgres").mark("transaction__start"){}'
...
LOG:  autovacuum launcher process (PID 10592) was terminated by signal 4: Illegal instruction


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: PATCH: backtraces for error messages

From
Pavan Deolasee
Date:


On Thu, Jun 21, 2018 at 11:02 AM, Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Jun 21, 2018 at 12:35:10PM +0800, Craig Ringer wrote:
> I wrote it because I got sick of Assert(false) debugging, and I was chasing
> down some "ERROR:  08P01: insufficient data left in message" errors. Then I
> got kind of caught up in it... you know how it is.

Yes, I know that feeling!  I have been using as well the Assert(false)
and the upgrade-to-PANIC tricks a couple of times, so being able to get
more easily backtraces would be really nice.


Sometime back I'd suggested an idea to be able to dynamically manage log levels for elog messages [1]. That did not invoke much response, but I still believe some such facility will be very useful for debugging production systems. Now I concede that the POC patch that I sent is horribly written and has a bad UI, but those can be improved if there is interest. Given the lack of response, I suspect there is enough interest in the feature though. 

Thanks,
Pavan

[1] https://www.postgresql.org/message-id/CABOikdMvx_Kr_b4ELhJEoeGcLTZKrDma%2BfPZpoZVdpL7Zc0bVw%40mail.gmail.com

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: PATCH: backtraces for error messages

From
Craig Ringer
Date:


On Thu., 21 Jun. 2018, 19:26 Pavan Deolasee, <pavan.deolasee@gmail.com> wrote:


On Thu, Jun 21, 2018 at 11:02 AM, Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Jun 21, 2018 at 12:35:10PM +0800, Craig Ringer wrote:
> I wrote it because I got sick of Assert(false) debugging, and I was chasing
> down some "ERROR:  08P01: insufficient data left in message" errors. Then I
> got kind of caught up in it... you know how it is.

Yes, I know that feeling!  I have been using as well the Assert(false)
and the upgrade-to-PANIC tricks a couple of times, so being able to get
more easily backtraces would be really nice.


Sometime back I'd suggested an idea to be able to dynamically manage log levels for elog messages [1].


Huge +1 from me on being able to selectively manage logging on a module/subsystem, file, or line level.

I don't think I saw the post.

Such a thing would obviously make built in backtrace support much more useful.

Re: PATCH: backtraces for error messages

From
Andres Freund
Date:
On 2018-06-22 08:24:45 +0800, Craig Ringer wrote:
> On Thu., 21 Jun. 2018, 19:26 Pavan Deolasee, <pavan.deolasee@gmail.com>
> wrote:
> 
> >
> >
> > On Thu, Jun 21, 2018 at 11:02 AM, Michael Paquier <michael@paquier.xyz>
> > wrote:
> >
> >> On Thu, Jun 21, 2018 at 12:35:10PM +0800, Craig Ringer wrote:
> >> > I wrote it because I got sick of Assert(false) debugging, and I was
> >> chasing
> >> > down some "ERROR:  08P01: insufficient data left in message" errors.
> >> Then I
> >> > got kind of caught up in it... you know how it is.
> >>
> >> Yes, I know that feeling!  I have been using as well the Assert(false)
> >> and the upgrade-to-PANIC tricks a couple of times, so being able to get
> >> more easily backtraces would be really nice.
> >>
> >>
> > Sometime back I'd suggested an idea to be able to dynamically manage log
> > levels for elog messages [1].
> >
> 
> 
> Huge +1 from me on being able to selectively manage logging on a
> module/subsystem, file, or line level.
> 
> I don't think I saw the post.
> 
> Such a thing would obviously make built in backtrace support much more
> useful.

I strongly suggest keeping these as separate as possible. Either is
useful without the other, and both are nontrivial. Tackling them
together imo makes it much more likely to get nowhere.

- Andres


Re: PATCH: backtraces for error messages

From
Pavan Deolasee
Date:


On Fri, Jun 22, 2018 at 6:18 AM, Andres Freund <andres@anarazel.de> wrote:
On 2018-06-22 08:24:45 +0800, Craig Ringer wrote:
>
>
> Huge +1 from me on being able to selectively manage logging on a
> module/subsystem, file, or line level.
>
> I don't think I saw the post.
>
> Such a thing would obviously make built in backtrace support much more
> useful.

I strongly suggest keeping these as separate as possible. Either is
useful without the other, and both are nontrivial. Tackling them
together imo makes it much more likely to get nowhere.


Sorry, I did not mean to mix up two patches. I brought it up just in case it provides another idea about when and how to log the backtrace. So yeah, let's discuss that patch separately.

Thanks,
Pavan 

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: PATCH: backtraces for error messages

From
Craig Ringer
Date:
On 22 June 2018 at 08:48, Andres Freund <andres@anarazel.de> wrote:
On 2018-06-22 08:24:45 +0800, Craig Ringer wrote:
> On Thu., 21 Jun. 2018, 19:26 Pavan Deolasee, <pavan.deolasee@gmail.com>
> wrote:
>
> >
> >
> > On Thu, Jun 21, 2018 at 11:02 AM, Michael Paquier <michael@paquier.xyz>
> > wrote:
> >
> >> On Thu, Jun 21, 2018 at 12:35:10PM +0800, Craig Ringer wrote:
> >> > I wrote it because I got sick of Assert(false) debugging, and I was
> >> chasing
> >> > down some "ERROR:  08P01: insufficient data left in message" errors.
> >> Then I
> >> > got kind of caught up in it... you know how it is.
> >>
> >> Yes, I know that feeling!  I have been using as well the Assert(false)
> >> and the upgrade-to-PANIC tricks a couple of times, so being able to get
> >> more easily backtraces would be really nice.
> >>
> >>
> > Sometime back I'd suggested an idea to be able to dynamically manage log
> > levels for elog messages [1].
> >
>
>
> Huge +1 from me on being able to selectively manage logging on a
> module/subsystem, file, or line level.
>
> I don't think I saw the post.
>
> Such a thing would obviously make built in backtrace support much more
> useful.

I strongly suggest keeping these as separate as possible. Either is
useful without the other, and both are nontrivial. Tackling them
together imo makes it much more likely to get nowhere.


Totally agree, and it's why I raised this as its own thing.

Thanks.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: PATCH: backtraces for error messages

From
Korry Douglas
Date:
A few misc comments:

In general, +1.

It might be nice to move the stack trace code into a separate function (not static to elog.c) - I have often found it useful to attach backtraces to objects so I can debug complex allocation/deallocation problems.  

The major expense in capturing a stack trace is in the symbolization of the stack addresses, not the stack walk itself.  You typically want to symbolize the addresses at the time you generate the trace, but that's not always the case.  For example, if you want to record the stack trace of every call to AllocSetAlloc() (and attach the trace to the AllocChunk) performance gets unbearable if you symbolize every trace.  So a flag that specifies whether to symbolize would be nice too. 

If you don't symbolize, you need a way to capture the address range of each dynamically loaded shared object (so that you can later symbolize using something like addr2line).  

(The above suggestions relate to server debugging, not application debugging).

End of wish list...

     -- Korry

On Fri, Jun 22, 2018 at 3:07 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
On 22 June 2018 at 08:48, Andres Freund <andres@anarazel.de> wrote:
On 2018-06-22 08:24:45 +0800, Craig Ringer wrote:
> On Thu., 21 Jun. 2018, 19:26 Pavan Deolasee, <pavan.deolasee@gmail.com>
> wrote:
>
> >
> >
> > On Thu, Jun 21, 2018 at 11:02 AM, Michael Paquier <michael@paquier.xyz>
> > wrote:
> >
> >> On Thu, Jun 21, 2018 at 12:35:10PM +0800, Craig Ringer wrote:
> >> > I wrote it because I got sick of Assert(false) debugging, and I was
> >> chasing
> >> > down some "ERROR:  08P01: insufficient data left in message" errors.
> >> Then I
> >> > got kind of caught up in it... you know how it is.
> >>
> >> Yes, I know that feeling!  I have been using as well the Assert(false)
> >> and the upgrade-to-PANIC tricks a couple of times, so being able to get
> >> more easily backtraces would be really nice.
> >>
> >>
> > Sometime back I'd suggested an idea to be able to dynamically manage log
> > levels for elog messages [1].
> >
>
>
> Huge +1 from me on being able to selectively manage logging on a
> module/subsystem, file, or line level.
>
> I don't think I saw the post.
>
> Such a thing would obviously make built in backtrace support much more
> useful.

I strongly suggest keeping these as separate as possible. Either is
useful without the other, and both are nontrivial. Tackling them
together imo makes it much more likely to get nowhere.


Totally agree, and it's why I raised this as its own thing.

Thanks.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: PATCH: backtraces for error messages

From
Craig Ringer
Date:
On 22 June 2018 at 23:26, Korry Douglas <korry.douglas@enterprisedb.com> wrote:
A few misc comments:

In general, +1.

It might be nice to move the stack trace code into a separate function (not static to elog.c) - I have often found it useful to attach backtraces to objects so I can debug complex allocation/deallocation problems.  

Good suggestion.
 

The major expense in capturing a stack trace is in the symbolization of the stack addresses, not the stack walk itself.  You typically want to symbolize the addresses at the time you generate the trace, but that's not always the case.  For example, if you want to record the stack trace of every call to AllocSetAlloc() (and attach the trace to the AllocChunk) performance gets unbearable if you symbolize every trace.  So a flag that specifies whether to symbolize would be nice too. 

libunwind has some nifty caching that makes that a _lot_ cheaper; that's part of why I went with it despite the extra lib requirement.

 
If you don't symbolize, you need a way to capture the address range of each dynamically loaded shared object (so that you can later symbolize using something like addr2line).

Yeah. libunwind doesn't expose any interface to get that information, so you'd have to use platform APIs, like glibc's dl_iterate_phdr or dladdr, or capture /proc/self/maps. You have to make sure that info makes it to the log, and is re-output often enough not to be lost to log rotation, and is invalidated and re-output if mappings change due to new lib loads etc. But you don't want to print it all the time either.

Then you have to walk the stack and print the instruction pointers and stack pointers and spit out raw traces for the user to reassemble.

I'd argue that if you're doing the sort of thing where you want a stack of every AllocSetAlloc, you shouldn't be trying to do that in-process. That's where tools like perf, systemtap, etc really come into their own. I'm focused on making additional diagnostic info for errors and key log messages collectable for systems that aren't easily accessed, like users who have 12-hour read/response latencies and security setups as strict as they are insane and nonsensical.

I'd have no objection to the option to disable symbolic back traces and print the raw call stack. It's trivial to do; in fact, I only removed the ip/sp info to keep the output more concise.

I'm not interested in writing anything to handle the library load address mapping collection etc though. I don't see a really sensible way to do that in a log-based system.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: PATCH: backtraces for error messages

From
Craig Ringer
Date:
On 23 June 2018 at 20:22, Craig Ringer <craig@2ndquadrant.com> wrote:
 
Yeah. libunwind doesn't expose any interface to get that information, so you'd have to use platform APIs, like glibc's dl_iterate_phdr or dladdr, or capture /proc/self/maps.

Ahem, I take that part back.


see /usr/include/link.h



--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: PATCH: backtraces for error messages

From
Craig Ringer
Date:
On 21 June 2018 at 19:09, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

I think this for assertion failure is no problem but I'm not sure
for other cases.

I think it's pretty strongly desirable for PANIC.
 
We could set proper context description or other
additional information in error messages before just dumping a
trace for known cases.

Yeah. The trouble there is that there are a _lot_ of places to touch for such things, and inevitably the one you really want to see will be something that didn't get suitably annotated.
 
Since PG9.5 RPMs are complied with --enable-dtrace so we could
use system tap to insert the stack-trace stuff. Additional
built-in probe in error reporting system or assertions would make
this pluggable.

I don't bother with SystemTAP anymore; perf does the job for most purposes.

You can use --enable-dtrace SDTs with Perf fine. I wrote it up for Pg at https://wiki.postgresql.org/wiki/Profiling_with_perf#PostgreSQL_pre-defined_tracepoint_events .

Dynamic tracepoints are also very useful.

Both require debuginfo, but so does outputting of symbolic stacks per my patch.

(That reminds me, I need to chat with Devrim about creating a longer lived debuginfo + old versions rpms repo for Pg its self, if not the accessory bits and pieces. I'm so constantly frustrated by not being able to get needed debuginfo packages to investigate some core or running system problem because they've been purged from the PGDG yum repo as soon as a new version comes out.)

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: PATCH: backtraces for error messages

From
Kyotaro HORIGUCHI
Date:
Hi.

At Mon, 25 Jun 2018 09:32:36 +0800, Craig Ringer <craig@2ndquadrant.com> wrote in
<CAMsr+YGBw9tgKRGxyihVeMzmjQx_2t8D17tE7t5-0gMdW7S6UA@mail.gmail.com>
> On 21 June 2018 at 19:09, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp
> > wrote:
> 
> I think this for assertion failure is no problem but I'm not sure
> > for other cases.
> 
> 
> I think it's pretty strongly desirable for PANIC.

Ah, I forgot about that. I agree to that. The cost to collect the
information is not a problem on PANIC. However still I don't
think stack trace is valuable on all PANIC messages. I can accept
the guc to control it but it is preferable that this works fine
without such an extra setup.

> > We could set proper context description or other
> > additional information in error messages before just dumping a
> > trace for known cases.
> >
> 
> Yeah. The trouble there is that there are a _lot_ of places to touch for
> such things, and inevitably the one you really want to see will be
> something that didn't get suitably annotated.

Agreed, it is the reality.  Instaed, can't we make a new error
classes PANIC_STACKDUMP and ERROR_STACKDUMP to explicitly
restrict stack dump for elog()?  Or elog_stackdump() and elog()
is also fine for me. Using it is easier than proper
annotating. It would be perfect if we could invent an automated
way but I don't think it is realistic.

> > Since PG9.5 RPMs are complied with --enable-dtrace so we could
> > use system tap to insert the stack-trace stuff. Additional
> > built-in probe in error reporting system or assertions would make
> > this pluggable.
> >
> 
> I don't bother with SystemTAP anymore; perf does the job for most purposes.
> 
> You can use --enable-dtrace SDTs with Perf fine. I wrote it up for Pg at
> https://wiki.postgresql.org/wiki/Profiling_with_perf#PostgreSQL_pre-defined_tracepoint_events
> .
> 
> Dynamic tracepoints are also very useful.
> 
> Both require debuginfo, but so does outputting of symbolic stacks per my
> patch.

Mmm. If I understand you correctly, I mean that perf doesn't dump
a backtrace on a probe point but trace points are usable to take
a symbolic backtrace. (I'm sorry that I cannot provide an example
since stap doesn't work in my box..)

If your intention is to take back traces without any setting (I
think it is preferable), it should be restricted to the required
points. It can be achieved by the additional error classes or
substitute error output functions.

As just an idea but can't we use an definition file on that
LOCATION of error messages that needs to dump a backtrace are
listed? That list is usually empty and should be very short if
any. The LOCATION information is easily obtained from a verbose
error message itself if once shown but a bit hard to find
otherwise..

> (That reminds me, I need to chat with Devrim about creating a longer lived
> debuginfo + old versions rpms repo for Pg its self, if not the accessory
> bits and pieces. I'm so constantly frustrated by not being able to get
> needed debuginfo packages to investigate some core or running system
> problem because they've been purged from the PGDG yum repo as soon as a new
> version comes out.)

We in our department take care to preserve them for ourselves for
the necessity of supporting older systems. I sometimes feel that
It is very helpful if they were available on the official site.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: PATCH: backtraces for error messages

From
Craig Ringer
Date:
On 25 June 2018 at 14:21, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Hi.

At Mon, 25 Jun 2018 09:32:36 +0800, Craig Ringer <craig@2ndquadrant.com> wrote in <CAMsr+YGBw9tgKRGxyihVeMzmjQx_2t8D17tE7t5-0gMdW7S6UA@mail.gmail.com>
> On 21 June 2018 at 19:09, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp
> > wrote:
>
> I think this for assertion failure is no problem but I'm not sure
> > for other cases.
>
>
> I think it's pretty strongly desirable for PANIC.

Ah, I forgot about that. I agree to that. The cost to collect the
information is not a problem on PANIC. However still I don't
think stack trace is valuable on all PANIC messages. I can accept
the guc to control it but it is preferable that this works fine
without such an extra setup.

Places such as?
 

> > We could set proper context description or other
> > additional information in error messages before just dumping a
> > trace for known cases.
> >
>
> Yeah. The trouble there is that there are a _lot_ of places to touch for
> such things, and inevitably the one you really want to see will be
> something that didn't get suitably annotated.

Agreed, it is the reality.  Instaed, can't we make a new error
classes PANIC_STACKDUMP and ERROR_STACKDUMP to explicitly
restrict stack dump for elog()?  Or elog_stackdump() and elog()
is also fine for me. Using it is easier than proper
annotating. It would be perfect if we could invent an automated
way but I don't think it is realistic.

That needlessly complicates error severity levels with information not really related to the severity. -1 from me.
 
Mmm. If I understand you correctly, I mean that perf doesn't dump
a backtrace on a probe point but trace points are usable to take
a symbolic backtrace. (I'm sorry that I cannot provide an example
since stap doesn't work in my box..)

perf record --call-graph dwarf -e sdt_postgresql:checkpoint__start -u postgres 
perf report -g
 
If your intention is to take back traces without any setting (I
think it is preferable), it should be restricted to the required
points. It can be achieved by the additional error classes or
substitute error output functions.

Who's classifying all the possible points?

Which PANICs or assertion failures do you want to exempt?

I definitely do not want to emit stacks for everything, like my patch currently does. It's just a proof of concept. Later on I'll want control on a fine grained level at runtime of when that happens, but that's out of scope for this. For now the goal is emit stacks at times it's obviously pretty sensible to have a stack, and do it in a way that doesn't require per-error-site maintenance/changes or create backport hassle.
 
As just an idea but can't we use an definition file on that
LOCATION of error messages that needs to dump a backtrace are
listed? That list is usually empty and should be very short if
any. The LOCATION information is easily obtained from a verbose
error message itself if once shown but a bit hard to find
otherwise..

That's again veering into selective logging control territory. Rather than doing it for stack dump control only, it should be part of a broader control over dynamic and scoped verbosity, selective logging, and log options, like Pavan raised. I see stacks as just one knob that can be turned on/off here.

> (That reminds me, I need to chat with Devrim about creating a longer lived
> debuginfo + old versions rpms repo for Pg its self, if not the accessory
> bits and pieces. I'm so constantly frustrated by not being able to get
> needed debuginfo packages to investigate some core or running system
> problem because they've been purged from the PGDG yum repo as soon as a new
> version comes out.)

We in our department take care to preserve them for ourselves for
the necessity of supporting older systems. I sometimes feel that
It is very helpful if they were available on the official 

Maybe if I can get some interest in that, you might be willing to contribute your archives as a starter so we have them for back-versions?

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: PATCH: backtraces for error messages

From
Korry Douglas
Date:


On Sat, Jun 23, 2018 at 8:22 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
On 22 June 2018 at 23:26, Korry Douglas <korry.douglas@enterprisedb.com> wrote:
A few misc comments:

In general, +1.

It might be nice to move the stack trace code into a separate function (not static to elog.c) - I have often found it useful to attach backtraces to objects so I can debug complex allocation/deallocation problems.  

Good suggestion.
 

The major expense in capturing a stack trace is in the symbolization of the stack addresses, not the stack walk itself.  You typically want to symbolize the addresses at the time you generate the trace, but that's not always the case.  For example, if you want to record the stack trace of every call to AllocSetAlloc() (and attach the trace to the AllocChunk) performance gets unbearable if you symbolize every trace.  So a flag that specifies whether to symbolize would be nice too. 

libunwind has some nifty caching that makes that a _lot_ cheaper; that's part of why I went with it despite the extra lib requirement.

I've not used libunwind before - looking through the docs now.  It does seem much more flexible than backtrace(), thanks.
 
 
If you don't symbolize, you need a way to capture the address range of each dynamically loaded shared object (so that you can later symbolize using something like addr2line).

Yeah. libunwind doesn't expose any interface to get that information, so you'd have to use platform APIs, like glibc's dl_iterate_phdr or dladdr, or capture /proc/self/maps. You have to make sure that info makes it to the log, and is re-output often enough not to be lost to log rotation, and is invalidated and re-output if mappings change due to new lib loads etc. But you don't want to print it all the time either.

Then you have to walk the stack and print the instruction pointers and stack pointers and spit out raw traces for the user to reassemble.

I'd argue that if you're doing the sort of thing where you want a stack of every AllocSetAlloc, you shouldn't be trying to do that in-process. That's where tools like perf, systemtap, etc really come into their own. I'm focused on making additional diagnostic info for errors and key log messages collectable for systems that aren't easily accessed, like users who have 12-hour read/response latencies and security setups as strict as they are insane and nonsensical.

Understood - I realized after I replied that instrumenting AllocSetAlloc() is a bit more complex than I had suggested.  When I need to capture the call stack of each allocation, I store the stack trace in malloc'ed buffers, otherwise I get recursive (in AllocSetAlloc()).  I suspect that's unreasonable for a general-purpose solution.

I'd have no objection to the option to disable symbolic back traces and print the raw call stack. It's trivial to do; in fact, I only removed the ip/sp info to keep the output more concise.

I'm not interested in writing anything to handle the library load address mapping collection etc though. I don't see a really sensible way to do that in a log-based system.

Agreed - I tend to use (saved, not logged) stack traces to find memory leaks that valgrind can't see (or more precisely, excessive resource consumption that is not actually leaked).

Thanks for your consideration.

     -- Korry

Re: PATCH: backtraces for error messages

From
Kyotaro HORIGUCHI
Date:
Hello. Thaks for discussing.

At Mon, 25 Jun 2018 17:08:41 +0800, Craig Ringer <craig@2ndquadrant.com> wrote in
<CAMsr+YGdJTGygmGhRhfn+8DLi6o+Teq+tcA-Dr3kK+8vYqwzCA@mail.gmail.com>
> On 25 June 2018 at 14:21, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp
> > > I think it's pretty strongly desirable for PANIC.
> >
> > Ah, I forgot about that. I agree to that. The cost to collect the
> > information is not a problem on PANIC. However still I don't
> > think stack trace is valuable on all PANIC messages. I can accept
> > the guc to control it but it is preferable that this works fine
> > without such an extra setup.
> >
> 
> Places such as?

Sorry but I didn't get this.

> > Agreed, it is the reality.  Instaed, can't we make a new error
> > classes PANIC_STACKDUMP and ERROR_STACKDUMP to explicitly
> > restrict stack dump for elog()?  Or elog_stackdump() and elog()
> > is also fine for me. Using it is easier than proper
> > annotating. It would be perfect if we could invent an automated
> > way but I don't think it is realistic.
> >
> 
> That needlessly complicates error severity levels with information not
> really related to the severity. -1 from me.

Ok.

> > Mmm. If I understand you correctly, I mean that perf doesn't dump
> > a backtrace on a probe point but trace points are usable to take
> > a symbolic backtrace. (I'm sorry that I cannot provide an example
> > since stap doesn't work in my box..)
> >
> 
> perf record --call-graph dwarf -e sdt_postgresql:checkpoint__start -u
> postgres
> perf report -g

Good to know that but unfortunately it doesn't work for me. It
(USDT support) is still in the perf Wiki Todo list. Some googling
told me that it is available on Linux 4.4. CentOS 7's kernel is
3.10. It seems a cutting-edge feature. However, I suppose that
what it emits is different from what wanted here.

But regardless of that, I agree that it is a bit hard to use
on-site.. (stap doesn't work for me from uncertain reason..)

> > If your intention is to take back traces without any setting (I
> > think it is preferable), it should be restricted to the required
> > points. It can be achieved by the additional error classes or
> > substitute error output functions.
> >
> 
> Who's classifying all the possible points?

I didn't mean classifying all points. I meant that only the
points where it is needed would be enough for the first step.

> Which PANICs or assertion failures do you want to exempt?

I didn't intended to cover all required casees. Intended that
save just a few or several obviously valuable cases, where we
gave up to add adequate context information for some reasons and
backtrace gives valuable information. Most xlog stuff wouldn't
need it. Lock stuff might need it. Heapam doesn't seem to need
since they seem to caused by calling order. etc... But such
control is out of this patch, I know.

> I definitely do not want to emit stacks for everything, like my patch
> currently does. It's just a proof of concept. Later on I'll want control on
> a fine grained level at runtime of when that happens, but that's out of
> scope for this. For now the goal is emit stacks at times it's obviously
> pretty sensible to have a stack, and do it in a way that doesn't require
> per-error-site maintenance/changes or create backport hassle.

Ok. The PoC is focusing on the means to emit backtraces and
controlling logs are the different issue, as Andres said
upthread.

I think that backtrace() generates somewhat strange-looking
output and we can get cleaner one using unwind. So I vote for
unwind to implement this. The cost is not a matter as far as we
intend to emit this for only Asserts or PANICs. Choosing some
ERRORs by a GUC is also fine for me.

> > As just an idea but can't we use an definition file on that
> > LOCATION of error messages that needs to dump a backtrace are
> > listed? That list is usually empty and should be very short if
> > any. The LOCATION information is easily obtained from a verbose
> > error message itself if once shown but a bit hard to find
> > otherwise..
> >
> 
> That's again veering into selective logging control territory. Rather than
> doing it for stack dump control only, it should be part of a broader
> control over dynamic and scoped verbosity, selective logging, and log
> options, like Pavan raised. I see stacks as just one knob that can be
> turned on/off here.
> 
> > (That reminds me, I need to chat with Devrim about creating a longer lived
> > > debuginfo + old versions rpms repo for Pg its self, if not the accessory
> > > bits and pieces. I'm so constantly frustrated by not being able to get
> > > needed debuginfo packages to investigate some core or running system
> > > problem because they've been purged from the PGDG yum repo as soon as a
> > new
> > > version comes out.)
> >
> > We in our department take care to preserve them for ourselves for
> > the necessity of supporting older systems. I sometimes feel that
> > It is very helpful if they were available on the official
> 
> 
> Maybe if I can get some interest in that, you might be willing to
> contribute your archives as a starter so we have them for back-versions?

Unfortunately I think we cannot do that for some reasons. Apart
from the reasons, I think that the official site removes older
versions by a policy and I'm not sure breaking it in other places
is good deed or not.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center