Thread: AssertLog instead of Assert in some places
Hi All, PostgreSQL code uses Assert() as a way to 1. document the assumption or conditions which must be true at a given place in code 2. make sure that some bug elsewhere does not break these assumptions or rules 3. conditions which can not be (easily) induced by user actions E.g. following conditions in adjust_appendrel_attrs() /* If there's nothing to adjust, don't call this function. */ Assert(nappinfos >= 1 && appinfos != NULL); /* Should never be translating a Query tree. */ Assert(node == NULL || !IsA(node, Query)); These conditions do not make it to non-assert builds and thus do not make it to the production binary. That saves some CPU cycles. When an Assert fails, and it fails quite a lot for developers, the Postgres backend that caused the Assert is Aborted, restarting the server. So a developer testing code that caused the Assert has to start a psql again, run any session setup before running the faulting query, gdb needs to be reattached to the new backend process. That's some waste of time and frustration, esp. when the Assert does not damage the backend itself e.g. by corrupting memory. Most of the Asserts are recoverable by rolling back the transaction without crashing the backend. So an elog(ERROR, ) is enough. But just by themselves elogs are compiled into non-debug binary and the condition check can waste CPU cycles esp. conditions are mostly true like the ones we use in Assert. Attached patch combines Assert and elog(ERROR, ) so that when an Assert is triggered in assert-enabled binary, it will throw an error while keeping the backend intact. Thus it does not affect gdb session or psql session. These elogs do not make their way to non-assert binary so do not make it to production like Assert. I have been using AssertLog for my work for some time. It is quite convenient. With AssertLog I get ``` #explain (summary on) select * from a, b, c where a.c1 = b.c1 and b.c1 = c.c1 and a.c2 < b.c2 and a.c3 + b.c3 < c.c3; ERROR: failed Assert("equal(child_rinfo, adjust_appendrel_attrs(root, (Node *) rinfo, nappinfos, appinfos))"), File: "relnode.c", Line: 997, PID: 568088 ``` instead of ``` #explain (summary on) select * from a, b, c where a.c1 = b.c1 and b.c1 = c.c1 and a.c2 < b.c2 and a.c3 + b.c3 < c.c3; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. The connection to the server was lost. Attempting reset: Failed. @!#\q ``` If there is an interest in accepting the patch, I will add it to the commitfest and work on it further. -- Best Wishes, Ashutosh Bapat
Attachment
Hi, On 2023-08-11 17:59:37 +0530, Ashutosh Bapat wrote: > Most of the Asserts are recoverable by rolling back the transaction > without crashing the backend. So an elog(ERROR, ) is enough. But just > by themselves elogs are compiled into non-debug binary and the > condition check can waste CPU cycles esp. conditions are mostly true > like the ones we use in Assert. > > Attached patch combines Assert and elog(ERROR, ) so that when an > Assert is triggered in assert-enabled binary, it will throw an error > while keeping the backend intact. Thus it does not affect gdb session > or psql session. These elogs do not make their way to non-assert > binary so do not make it to production like Assert. I am quite strongly against this. This will lead to assertions being hit in tests without that being noticed, e.g. because they happen in a background process that just restarts. Greetings, Andres Freund
On Fri, Aug 11, 2023 at 10:57 AM Andres Freund <andres@anarazel.de> wrote: > I am quite strongly against this. This will lead to assertions being hit in > tests without that being noticed, e.g. because they happen in a background > process that just restarts. Couldn't you say the same thing about defensive "can't happen" ERRORs? They are essentially a form of assertion that isn't limited to assert-enabled builds. I have sometimes thought that it would be handy if there was a variant of "can't happen" ERRORs that took on the structure of an assertion. (This is quite different to what Ashutosh has proposed, though, since it would still look like a conventional assertion failure on assert-enabled builds.) -- Peter Geoghegan
Hi, On 2023-08-11 11:14:34 -0700, Peter Geoghegan wrote: > On Fri, Aug 11, 2023 at 10:57 AM Andres Freund <andres@anarazel.de> wrote: > > I am quite strongly against this. This will lead to assertions being hit in > > tests without that being noticed, e.g. because they happen in a background > > process that just restarts. > > Couldn't you say the same thing about defensive "can't happen" ERRORs? > They are essentially a form of assertion that isn't limited to > assert-enabled builds. Yes. A lot of them I hate them with the passion of a thousand suns ;). "Oh, our transaction state machinery is confused. Yes, let's just continue going through the same machinery again, that'll resolve it.". Even worse are the ones that are just WARNINGS. Like "Oh, something wrote beyond the length of allocated memory, that's something to issue a WARNING about and happily continue". There've been people arguing in the past that it's good for robustness to do stuff like that. I think it's exactly the opposite. Now, don't get me wrong, there are plenty cases where a "this shouldn't happen" elog(ERROR) is appropriate... > I have sometimes thought that it would be handy if there was a variant > of "can't happen" ERRORs that took on the structure of an assertion. What are you imagining? Basically something that generates an elog(ERROR) with the stringified expression as part of the error message? Greetings, Andres Freund
On Fri, Aug 11, 2023 at 11:23 AM Andres Freund <andres@anarazel.de> wrote: > > Couldn't you say the same thing about defensive "can't happen" ERRORs? > > They are essentially a form of assertion that isn't limited to > > assert-enabled builds. > > Yes. A lot of them I hate them with the passion of a thousand suns ;). "Oh, > our transaction state machinery is confused. Yes, let's just continue going > through the same machinery again, that'll resolve it.". I am not unsympathetic to Ashutosh's point about conventional ERRORs being easier to deal with when debugging your own code, during initial development work. But that seems like a problem with the tooling in other areas. For example, dealing with core dumps left behind by the regression tests can be annoying. Don't you also hate it when there's a regression.diffs that just shows 20k lines of subtractions? Perhaps you don't -- perhaps your custom setup makes it quick and easy to get relevant information about what actually went wrong. But it seems like that sort of thing could be easier to deal with by default, without using custom shell scripts or anything -- particularly for those of us that haven't been Postgres hackers for eons. > There've been people arguing in the past that it's good for robustness to do > stuff like that. I think it's exactly the opposite. > > Now, don't get me wrong, there are plenty cases where a "this shouldn't > happen" elog(ERROR) is appropriate... Right. They're not bad -- they just need to be backed up by some kind of reasoning, which will be particular to each case. The default approach should be to crash whenever any invariants are violated, because all bets are off at that point. > What are you imagining? Basically something that generates an elog(ERROR) with > the stringified expression as part of the error message? I'd probably start with a new elevel, that implied an assertion failure in assert-enabled builds but otherwise acted just like ERROR. I remember multiple cases where I added an "Assert(false)" right after a "can't happen" error, which isn't a great approach. It might also be useful to offer something along the lines you've described, which I was sort of thinking of myself. But now that I've thought about it a little more, I think that such an approach might end up being overused. If you're going to add what amounts to a "can't happen" ERROR then you should really be obligated to write a halfway reasonable error message. As I said, you should have to own the fact that you think it's better to not just crash for this one particular callsite, based on some fairly specific chain of reasoning. Ideally it'll be backed up by practical experience/user reports. -- Peter Geoghegan
Hi, On 2023-08-11 11:56:27 -0700, Peter Geoghegan wrote: > On Fri, Aug 11, 2023 at 11:23 AM Andres Freund <andres@anarazel.de> wrote: > > > Couldn't you say the same thing about defensive "can't happen" ERRORs? > > > They are essentially a form of assertion that isn't limited to > > > assert-enabled builds. > > > > Yes. A lot of them I hate them with the passion of a thousand suns ;). "Oh, > > our transaction state machinery is confused. Yes, let's just continue going > > through the same machinery again, that'll resolve it.". > > I am not unsympathetic to Ashutosh's point about conventional ERRORs > being easier to deal with when debugging your own code, during initial > development work. Oh, I am as well - I just don't think it's a good idea to introduce "log + error" assertions to core postgres, because it seems very likely that they'll end up getting used a lot. > But that seems like a problem with the tooling in other areas. Agreed. > For example, dealing with core dumps left behind by the regression > tests can be annoying. Hm. I don't have a significant problem with that. But I can see it being problematic. Unfortunately, short of preventing core dumps from happening, I don't think we really can do much about that - whatever is running the tests shouldn't have privileges to change system wide settings about where core dumps end up etc. > Don't you also hate it when there's a regression.diffs that just shows 20k > lines of subtractions? Perhaps you don't -- perhaps your custom setup makes > it quick and easy to get relevant information about what actually went > wrong. I do really hate that. At the very least we should switch to using restart-after-crash by default, and not start new tests once the server has crashed and do a waitpid(postmaster, WNOHANG) after each failing test, to see if the reason the test failed is that the backend died. > But it seems like that sort of thing could be easier to deal with by > default, without using custom shell scripts or anything -- particularly for > those of us that haven't been Postgres hackers for eons. Yes, wholeheartedly agreed. Greetings, Andres Freund
On Fri, Aug 11, 2023 at 12:26 PM Andres Freund <andres@anarazel.de> wrote: > > For example, dealing with core dumps left behind by the regression > > tests can be annoying. > > Hm. I don't have a significant problem with that. But I can see it being > problematic. Unfortunately, short of preventing core dumps from happening, > I don't think we really can do much about that - whatever is running the tests > shouldn't have privileges to change system wide settings about where core > dumps end up etc. I was unclear. I wasn't talking about managing core dumps. I was talking about using core dumps to get a simple backtrace, that just gives me some very basic information. I probably shouldn't have even mentioned core dumps, because what I'm really concerned about is the workflow around getting very basic information about assertion failures. Not around core dumps per se. The inconsistent approach needed to get a simple, useful backtrace for assertion failures (along with other basic information associated with the failure) is a real problem. Particularly when running the tests. Most individual assertion failures that I see are for code that I'm practically editing in real time. So shaving cycles here really matters. For one thing the symbol mangling that we have around the built-in backtraces makes them significantly less useful. I really hope that your libbacktrace patch gets committed soon, since that looks like it would be a nice quality of life improvement, all on its own. It would also be great if the tests spit out information about assertion failures that was reasonably complete (backtrace without any mangling, query text included, other basic context), reliably and uniformly -- it shouldn't matter if it's from TAP or pg_regress test SQL scripts. Which kind of test happened to be involved is usually not interesting to me here (even the query text won't usually be interesting), so being forced to think about it slows me down quite a lot. > > Don't you also hate it when there's a regression.diffs that just shows 20k > > lines of subtractions? Perhaps you don't -- perhaps your custom setup makes > > it quick and easy to get relevant information about what actually went > > wrong. > > I do really hate that. At the very least we should switch to using > restart-after-crash by default, and not start new tests once the server has > crashed and do a waitpid(postmaster, WNOHANG) after each failing test, to see > if the reason the test failed is that the backend died. +1 -- Peter Geoghegan
Hi, On 2023-08-11 13:19:34 -0700, Peter Geoghegan wrote: > On Fri, Aug 11, 2023 at 12:26 PM Andres Freund <andres@anarazel.de> wrote: > > > For example, dealing with core dumps left behind by the regression > > > tests can be annoying. > > > > Hm. I don't have a significant problem with that. But I can see it being > > problematic. Unfortunately, short of preventing core dumps from happening, > > I don't think we really can do much about that - whatever is running the tests > > shouldn't have privileges to change system wide settings about where core > > dumps end up etc. > > I was unclear. I wasn't talking about managing core dumps. I was > talking about using core dumps to get a simple backtrace, that just > gives me some very basic information. I probably shouldn't have even > mentioned core dumps, because what I'm really concerned about is the > workflow around getting very basic information about assertion > failures. Not around core dumps per se. > > The inconsistent approach needed to get a simple, useful backtrace for > assertion failures (along with other basic information associated with > the failure) is a real problem. Particularly when running the tests. > Most individual assertion failures that I see are for code that I'm > practically editing in real time. So shaving cycles here really > matters. Ah, yes. Agreed, obviously. > For one thing the symbol mangling that we have around the built-in > backtraces makes them significantly less useful. I really hope that > your libbacktrace patch gets committed soon, since that looks like it > would be a nice quality of life improvement, all on its own. I've been hacking a further on it: https://github.com/anarazel/postgres/tree/backtrace Haven't yet posted a new version. Doing it properly for fronted tools has some dependencies on threading stuff. I'm hoping Thomas' patch for that will go in. Now it also intercepts segfaults and prints backtraces for them, if that's possible (it requires libbacktrace to be async signal safe, which it isn't on all platforms). Where supported, a crash (distinguishing from assertion failures) will now report something like: process with pid: 2900527 received signal: SIGSEGV, si_code: 1, si_addr: 0xdeadbeef [0x5628ec45212f] pg_fatalsig_handler+0x20f: ../../../../home/andres/src/postgresql/src/common/pg_backtrace.c:615 [0x7fc4b743650f] [unknown] [0x5628ec14897c] check_root+0x19c (inlined): ../../../../home/andres/src/postgresql/src/backend/main/main.c:393 [0x5628ec14897c] main+0x19c: ../../../../home/andres/src/postgresql/src/backend/main/main.c:183 after I added *(volatile int*)0xdeadbeef = 1; to check_root(). For signals sent by users, it'd show the pid of the process sending the signal on most OSs. I really would like some generalized infrastructure for that, so that we can report for things like query cancellations. As the patch stands, the only OS that doesn't yet have useful "backtrace on crash" support with that is windows, as libbacktrace unfortunately isn't signal safe on windows. But it'd still provide useful backtraces on Assert()s. I haven't yet figured out whether/when it's required to be signal safe on windows though - crashes are intercepted by SetUnhandledExceptionFilter() - I don't understand the precise constraints of that. Plenty people seem to generate backtraces on crashes on windows using that, without concerns for signal safety like things. Currently Macos CI doesn't use libbacktrace, but as it turns out backtrace_symbols() on windows is a heck of a lot better than on glibc platforms. CI for windows with visual studio doesn't have libbacktrace installed yet (and has the aforementioned signal safety issue), I think it's installed for windows w/ mingw. > It would also be great if the tests spit out information about > assertion failures that was reasonably complete (backtrace without any > mangling, query text included, other basic context), reliably and > uniformly -- it shouldn't matter if it's from TAP or pg_regress test > SQL scripts. Hm. What other basic context are you thinking of? Pid is obvious. I guess backend type could be useful too, but normally be inferred from the stack trace pretty easily. Application name could be useful to know which test caused the crash. I'm a bit wary about trying to print things like query text - what if that string points to memory not terminated by a \0? I guess we could use an approach similar to pgstat_get_crashed_backend_activity(). One issue with reporting crashes from signal handlers is that the obvious way to make that signal safe (lots of small writes) leads to the potential for interspersed lines. It's probably worth having a statically sized buffer that will commonly be large enough to print a whole backtrace. When too small, it should include the pid at the start of every "chunk". > Which kind of test happened to be involved is usually not interesting to me > here (even the query text won't usually be interesting), so being forced to > think about it slows me down quite a lot. Interesting - I quite often end up spending time trying to dig out which query from what sql file triggered a crash, so I can try to trigger it in isolation. I often wished the server knew the source line associated with the query. Enough that I pondered ways to have psql transport that knowledge to the the server. Greetings, Andres Freund
On Fri, Aug 11, 2023 at 2:04 PM Andres Freund <andres@anarazel.de> wrote: > Where supported, a crash (distinguishing from assertion failures) will now > report something like: > > process with pid: 2900527 received signal: SIGSEGV, si_code: 1, si_addr: 0xdeadbeef > [0x5628ec45212f] pg_fatalsig_handler+0x20f: ../../../../home/andres/src/postgresql/src/common/pg_backtrace.c:615 > [0x7fc4b743650f] [unknown] > [0x5628ec14897c] check_root+0x19c (inlined): ../../../../home/andres/src/postgresql/src/backend/main/main.c:393 > [0x5628ec14897c] main+0x19c: ../../../../home/andres/src/postgresql/src/backend/main/main.c:183 > > after I added > *(volatile int*)0xdeadbeef = 1; > to check_root(). It'll be like living in the future! > For signals sent by users, it'd show the pid of the process sending the signal > on most OSs. I really would like some generalized infrastructure for that, so > that we can report for things like query cancellations. That sounds great. > > It would also be great if the tests spit out information about > > assertion failures that was reasonably complete (backtrace without any > > mangling, query text included, other basic context), reliably and > > uniformly -- it shouldn't matter if it's from TAP or pg_regress test > > SQL scripts. > > Hm. What other basic context are you thinking of? Pid is obvious. I guess > backend type could be useful too, but normally be inferred from the stack > trace pretty easily. Application name could be useful to know which test > caused the crash. > > I'm a bit wary about trying to print things like query text - what if that > string points to memory not terminated by a \0? I guess we could use an > approach similar to pgstat_get_crashed_backend_activity(). I agree that being less verbose by default is good. On second thought even query text isn't all that important. > One issue with reporting crashes from signal handlers is that the obvious way > to make that signal safe (lots of small writes) leads to the potential for > interspersed lines. It's probably worth having a statically sized buffer that > will commonly be large enough to print a whole backtrace. When too small, it > should include the pid at the start of every "chunk". Good idea. > > Which kind of test happened to be involved is usually not interesting to me > > here (even the query text won't usually be interesting), so being forced to > > think about it slows me down quite a lot. > > Interesting - I quite often end up spending time trying to dig out which query > from what sql file triggered a crash, so I can try to trigger it in > isolation. I often wished the server knew the source line associated with the > query. Enough that I pondered ways to have psql transport that knowledge to the > the server. I actually do plenty of that too. My overall point was this: there is likely some kind of pareto principle here. That should guide the sorts of scenarios we optimize for. If you actually benchmarked where I spent time while writing code, minute to minute, I bet it would show that most of the individual debug-compile cycles were triggered by issues that had a fairly simple and immediate cause. Cases where I improve one small thing, and then rerun the tests, which show an assertion failure in nearby code. As soon as I see very basic details I immediately think "duh, of course" in these cases, at which point I'll come up with a likely-good fix in seconds. And then I'll rinse and repeat. My fix might just work (at least to the extent that all tests now pass), but it also might lead to another assertion failure of the same general nature. There are also lots of cases where I really do have to think about recreating the details from the test in order to truly understand what's going on, of course. But there are still way way more individual "duh, of course" assertion failures in practice. Those are where productivity wins are still possible, because the bottleneck isn't just that I have an incomplete mental model that I need to work to expand. Perhaps my experiences here aren't universal. But it seems like they might be roughly the same as everybody else that works on Postgres? Assuming that they are, then the information that is output should be optimized for the "duh, of course" scenarios. Not to an absurd degree, mind you. But the output shouldn't be too verbose. Ideally there'd be a still fairly straightforward way of getting extra information, for the cases where debugging is likely to take a few minutes, and require real focus. The extra work in those other cases is relatively insignificant, because the "startup costs" are relatively large -- a little extra indirection (though only a little) can't hurt too much. -- Peter Geoghegan
On Fri, Aug 11, 2023 at 11:27 PM Andres Freund <andres@anarazel.de> wrote: > > > > Attached patch combines Assert and elog(ERROR, ) so that when an > > Assert is triggered in assert-enabled binary, it will throw an error > > while keeping the backend intact. Thus it does not affect gdb session > > or psql session. These elogs do not make their way to non-assert > > binary so do not make it to production like Assert. > > I am quite strongly against this. This will lead to assertions being hit in > tests without that being noticed, e.g. because they happen in a background > process that just restarts. Fair point. Our regression doesn't check server error logs for unwanted errors. How about restricting it to only client backends? I don't know how to identify those from others but there must be a way. -- Best Wishes, Ashutosh Bapat
On Sat, Aug 12, 2023 at 12:56 AM Andres Freund <andres@anarazel.de> wrote: > On 2023-08-11 11:56:27 -0700, Peter Geoghegan wrote: > > On Fri, Aug 11, 2023 at 11:23 AM Andres Freund <andres@anarazel.de> wrote: > > > > Couldn't you say the same thing about defensive "can't happen" ERRORs? > > > > They are essentially a form of assertion that isn't limited to > > > > assert-enabled builds. > > > > > > Yes. A lot of them I hate them with the passion of a thousand suns ;). "Oh, > > > our transaction state machinery is confused. Yes, let's just continue going > > > through the same machinery again, that'll resolve it.". > > > > I am not unsympathetic to Ashutosh's point about conventional ERRORs > > being easier to deal with when debugging your own code, during initial > > development work. > > Oh, I am as well - I just don't think it's a good idea to introduce "log + error" > assertions to core postgres, because it seems very likely that they'll end up > getting used a lot. > > I am open to ideas which allow the same backend to recover after meeting an easily recoverable but "can't happen" condition rather than losing that backend and starting all over with a new backend. Not all Assert'ed conditions are recoverable so a blanket GUC or compile time option won't help. Those might make things worse. We need two separate incantations for non-recoverable and recoverable Asserts respectively. I like Peter's idea of having a new elevel, however it still requires adding conditional USE_ASSERT, an if testing the condition and then writing an error message. AssertLog() in the patch uses just a few more letters. It won't help to expand the scope of the problem since that will reduce the chances of getting anything done. -- Best Wishes, Ashutosh Bapat