Re: AssertLog instead of Assert in some places - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: AssertLog instead of Assert in some places
Date
Msg-id CAH2-Wzmqs8BAweWoneYJzSjFDci=3UTr+dAg2WrPV-vP5DHF3Q@mail.gmail.com
Whole thread Raw
In response to Re: AssertLog instead of Assert in some places  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Fri, Aug 11, 2023 at 2:04 PM Andres Freund <andres@anarazel.de> wrote:
> Where supported, a crash (distinguishing from assertion failures) will now
> report something like:
>
> process with pid: 2900527 received signal: SIGSEGV, si_code: 1, si_addr: 0xdeadbeef
>         [0x5628ec45212f] pg_fatalsig_handler+0x20f:
../../../../home/andres/src/postgresql/src/common/pg_backtrace.c:615
>         [0x7fc4b743650f] [unknown]
>         [0x5628ec14897c] check_root+0x19c (inlined):
../../../../home/andres/src/postgresql/src/backend/main/main.c:393
>         [0x5628ec14897c] main+0x19c: ../../../../home/andres/src/postgresql/src/backend/main/main.c:183
>
> after I added
>         *(volatile int*)0xdeadbeef = 1;
> to check_root().

It'll be like living in the future!

> For signals sent by users, it'd show the pid of the process sending the signal
> on most OSs. I really would like some generalized infrastructure for that, so
> that we can report for things like query cancellations.

That sounds great.

> > It would also be great if the tests spit out information about
> > assertion failures that was reasonably complete (backtrace without any
> > mangling, query text included, other basic context), reliably and
> > uniformly -- it shouldn't matter if it's from TAP or pg_regress test
> > SQL scripts.
>
> Hm. What other basic context are you thinking of? Pid is obvious. I guess
> backend type could be useful too, but normally be inferred from the stack
> trace pretty easily. Application name could be useful to know which test
> caused the crash.
>
> I'm a bit wary about trying to print things like query text - what if that
> string points to memory not terminated by a \0? I guess we could use an
> approach similar to pgstat_get_crashed_backend_activity().

I agree that being less verbose by default is good. On second thought
even query text isn't all that important.

> One issue with reporting crashes from signal handlers is that the obvious way
> to make that signal safe (lots of small writes) leads to the potential for
> interspersed lines.  It's probably worth having a statically sized buffer that
> will commonly be large enough to print a whole backtrace. When too small, it
> should include the pid at the start of every "chunk".

Good idea.

> > Which kind of test happened to be involved is usually not interesting to me
> > here (even the query text won't usually be interesting), so being forced to
> > think about it slows me down quite a lot.
>
> Interesting - I quite often end up spending time trying to dig out which query
> from what sql file triggered a crash, so I can try to trigger it in
> isolation. I often wished the server knew the source line associated with the
> query. Enough that I pondered ways to have psql transport that knowledge to the
> the server.

I actually do plenty of that too. My overall point was this: there is
likely some kind of pareto principle here. That should guide the sorts
of scenarios we optimize for.

If you actually benchmarked where I spent time while writing code,
minute to minute, I bet it would show that most of the individual
debug-compile cycles were triggered by issues that had a fairly simple
and immediate cause. Cases where I improve one small thing, and then
rerun the tests, which show an assertion failure in nearby code. As
soon as I see very basic details I immediately think "duh, of course"
in these cases, at which point I'll come up with a likely-good fix in
seconds. And then I'll rinse and repeat. My fix might just work (at
least to the extent that all tests now pass), but it also might lead
to another assertion failure of the same general nature.

There are also lots of cases where I really do have to think about
recreating the details from the test in order to truly understand
what's going on, of course. But there are still way way more
individual "duh, of course" assertion failures in practice. Those are
where productivity wins are still possible, because the bottleneck
isn't just that I have an incomplete mental model that I need to work
to expand.

Perhaps my experiences here aren't universal. But it seems like they
might be roughly the same as everybody else that works on Postgres?
Assuming that they are, then the information that is output should be
optimized for the "duh, of course" scenarios. Not to an absurd degree,
mind you. But the output shouldn't be too verbose. Ideally there'd be
a still fairly straightforward way of getting extra information, for
the cases where debugging is likely to take a few minutes, and require
real focus. The extra work in those other cases is relatively
insignificant, because the "startup costs" are relatively large -- a
little extra indirection (though only a little) can't hurt too much.

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: run pgindent on a regular basis / scripted manner
Next
From: Peter Geoghegan
Date:
Subject: Re: run pgindent on a regular basis / scripted manner