Thread: Should we make scary sounding, but actually routine, errors lessscary?
Hi, There's a few errors that we issue that are, often, much less bad than they sound. The most common cases I immediately can recall are: 1) Mentioning crash, once for each backend, when shutting down immediately. Currently the log output for that, with just two sessions connected, is the following: 2019-11-05 15:09:52.634 PST [9340][] LOG: 00000: received immediate shutdown request 2019-11-05 15:09:52.634 PST [9340][] LOCATION: pmdie, postmaster.c:2883 2019-11-05 15:09:52.634 PST [23199][4/0] WARNING: 57P02: terminating connection because of crash of another server process 2019-11-05 15:09:52.634 PST [23199][4/0] DETAIL: The postmaster has commanded this server process to roll back the currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2019-11-05 15:09:52.634 PST [23199][4/0] HINT: In a moment you should be able to reconnect to the database and repeat yourcommand. 2019-11-05 15:09:52.634 PST [23199][4/0] LOCATION: quickdie, postgres.c:2734 2019-11-05 15:09:52.634 PST [23187][3/0] WARNING: 57P02: terminating connection because of crash of another server process 2019-11-05 15:09:52.634 PST [23187][3/0] DETAIL: The postmaster has commanded this server process to roll back the currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2019-11-05 15:09:52.634 PST [23187][3/0] HINT: In a moment you should be able to reconnect to the database and repeat yourcommand. 2019-11-05 15:09:52.634 PST [23187][3/0] LOCATION: quickdie, postgres.c:2734 2019-11-05 15:09:52.634 PST [9345][1/0] WARNING: 57P02: terminating connection because of crash of another server process 2019-11-05 15:09:52.634 PST [9345][1/0] DETAIL: The postmaster has commanded this server process to roll back the currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2019-11-05 15:09:52.634 PST [9345][1/0] HINT: In a moment you should be able to reconnect to the database and repeat yourcommand. 2019-11-05 15:09:52.634 PST [9345][1/0] LOCATION: quickdie, postgres.c:2734 2019-11-05 15:09:52.644 PST [9340][] LOG: 00000: database system is shut down 2019-11-05 15:09:52.644 PST [9340][] LOCATION: UnlinkLockFiles, miscinit.c:859 (23199, 23187 are backends, 9345 is the autovacuum launcher) I think there's multiple things wrong with this. For one, reading that the server has (no might in there) crashed, is scary, when all that's happening is an intentional shutdown. But also the sheer log volume is bad - on a busy server there may be a *lot* of these lines. I think the log volume is bad even for an actual PANIC. I've spent *way* too much time scrolling through pages and pages of the above lines, just to find the one or two lines indicating an actual error. It seems like we ought to be able to somehow a) Signal that the server has been shut down in immediate mode, rather than actually crashed, and issue a different log message to the user. b) Stop issuing, to the server log, the same message over and over. We instead just ought to send the message to the client. We however need to be careful that we don't make it harder to debug a SIGQUIT sent directly to backend processes. 2) At the end of crash recovery, and often when the startup processes switches between WAL sources, we get scary messages like: 2019-11-05 15:20:21.907 PST [23407][] <> <> LOG: 00000: invalid record length at F/48C0A500: wanted 24, got 0 2019-11-05 15:20:21.907 PST [23407][] <> <> LOCATION: ReadRecord, xlog.c:4282 or 2019-11-05 15:35:03.321 PST [28518][1/0] LOG: invalid resource manager ID 52 at 3/7CD0B1B8 or any of the other types of xlogreader errors. I've seen countless tickets being raised because PG users looked at their logs and got scared. We want them to look at the logs, so this seems counter-productive. It seems we, at the very least, could add an error context or something explaining that a LOG message about the log end is to be expected. Or perhaps we could should reformulate the message to something like 'ELEVEL: reached end of valid WAL at XX/XX' 'DETAIL: end determined due to invalid record length: wanted 24, got 0' perhaps with a HINT in the elevel < ERROR case indicating that this is not a failure. 3) When a standby node is shutdown in immediate mode, we issue: 2019-11-05 15:45:58.722 PST [30321][] LOG: database system was interrupted while in recovery at log time 2019-11-05 15:37:43PST 2019-11-05 15:45:58.722 PST [30321][] HINT: If this has occurred more than once some data might be corrupted and you mightneed to choose an earlier recovery target. Which imo doesn't really make sense for at least standbys, which are expected to be in recovery forever. There's more than enough scenarios where one would shut down a standby in immediate mode for good reasons, we shouldn't issue such warnings in that case. The tricky part in improving this would be how to detect when the message should still be issued for a standby. One idea, which is not bullet proof but might be good enough, would be to record in the control file which position recovery was started from last time, and only issue the error when recovery would start from the same point. I'm sure there are more types of messages in this category, these are just the ones I could immediately recall from memory as having scared actual users unnecessarily. I don't plan on fixing these immediately myself, even if we were to agree on something, so if anybody is interested in helping... Greetings, Andres Freund
On 11/05/19 18:54, Andres Freund wrote: > Hi, > > There's a few errors that we issue that are, often, much less bad than > they sound. The most common cases I immediately can recall are: > > > 1) Mentioning crash, once for each backend, when shutting down > immediately. Currently the log output for that, with just two sessions > connected, is the following: While on the topic ... this may be more a property of particular packagings of the server, to run under systemd etc., but often there is a process during startup trying periodically to open a connection to the server to confirm that it has successfully started, and the result is a dozen or so log messages that say "FATAL: the server is starting" ... which is amusing once you get what it's doing, but a bit disconcerting until then. Not sure how that could be changed ... maybe a connection-time option trial_connection that would suppress the fatal ereport on rejecting the connection? Regards, -Chap
Re: Should we make scary sounding, but actually routine, errorsless scary?
From
Kyotaro Horiguchi
Date:
At Tue, 5 Nov 2019 15:54:22 -0800, Andres Freund <andres@anarazel.de> wrote in > Hi, > > There's a few errors that we issue that are, often, much less bad than > they sound. The most common cases I immediately can recall are: > > > 1) Mentioning crash, once for each backend, when shutting down > immediately. Currently the log output for that, with just two sessions > connected, is the following: > > 2019-11-05 15:09:52.634 PST [9340][] LOG: 00000: received immediate shutdown request > 2019-11-05 15:09:52.634 PST [9340][] LOCATION: pmdie, postmaster.c:2883 > 2019-11-05 15:09:52.634 PST [23199][4/0] WARNING: 57P02: terminating connection because of crash of another server process > 2019-11-05 15:09:52.634 PST [23199][4/0] DETAIL: The postmaster has commanded this server process to roll back the currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory. > 2019-11-05 15:09:52.634 PST [23199][4/0] HINT: In a moment you should be able to reconnect to the database and repeatyour command. > 2019-11-05 15:09:52.634 PST [23199][4/0] LOCATION: quickdie, postgres.c:2734 > 2019-11-05 15:09:52.634 PST [23187][3/0] WARNING: 57P02: terminating connection because of crash of another server process > 2019-11-05 15:09:52.634 PST [23187][3/0] DETAIL: The postmaster has commanded this server process to roll back the currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory. > 2019-11-05 15:09:52.634 PST [23187][3/0] HINT: In a moment you should be able to reconnect to the database and repeatyour command. > 2019-11-05 15:09:52.634 PST [23187][3/0] LOCATION: quickdie, postgres.c:2734 > 2019-11-05 15:09:52.634 PST [9345][1/0] WARNING: 57P02: terminating connection because of crash of another server process > 2019-11-05 15:09:52.634 PST [9345][1/0] DETAIL: The postmaster has commanded this server process to roll back the currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory. > 2019-11-05 15:09:52.634 PST [9345][1/0] HINT: In a moment you should be able to reconnect to the database and repeat yourcommand. > 2019-11-05 15:09:52.634 PST [9345][1/0] LOCATION: quickdie, postgres.c:2734 > 2019-11-05 15:09:52.644 PST [9340][] LOG: 00000: database system is shut down > 2019-11-05 15:09:52.644 PST [9340][] LOCATION: UnlinkLockFiles, miscinit.c:859 > > (23199, 23187 are backends, 9345 is the autovacuum launcher) > > I think there's multiple things wrong with this. For one, reading that > the server has (no might in there) crashed, is scary, when all that's > happening is an intentional shutdown. But also the sheer log volume is > bad - on a busy server there may be a *lot* of these lines. > > I think the log volume is bad even for an actual PANIC. I've spent *way* > too much time scrolling through pages and pages of the above lines, just > to find the one or two lines indicating an actual error. > > It seems like we ought to be able to somehow > > a) Signal that the server has been shut down in immediate mode, rather > than actually crashed, and issue a different log message to the user. > > b) Stop issuing, to the server log, the same message over and over. We > instead just ought to send the message to the client. We however need to > be careful that we don't make it harder to debug a SIGQUIT sent directly > to backend processes. I doubt that different messages for server log and client worth doing. Isn't it enough moving the cause description from backend message to postmaster one? Addition to that, I don't see such a message on connecting psql. It just reports as "server closed the connection unexpectedly" without a server message. If I'm not missing something the HINT message is useless.. 2019-11-05 15:09:52.634 PST [9340][] LOG: 00000: received immediate shutdown request 2019-11-05 15:09:52.634 PST [9340][] LOCATION: pmdie, postmaster.c:2883 + 2019-11-05 15:09:52.634 PST [9340][] LOG: terminating all active server processes - 2019-11-05 15:09:52.634 PST [23199][4/0] WARNING: 57P02: terminating connection because of crash of another server process + 2019-11-05 15:09:52.634 PST [23199][4/0] WARNING: 57P02: terminating connection due to command by postmaster - 2019-11-05 15:09:52.634 PST [23199][1/0] DETAIL: The postmaster has commanded this server process to roll back the currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory. - 2019-11-05 15:09:52.634 PST [23199][1/0] HINT: In a moment you should be able to reconnect to the database and repeatyour command. If some process has crashed, it looks like: 2019-11-06 10:47:11.387 JST [19737] LOG: 00000: server process (PID 19774) was terminated by signal 11: Segmentation fault 2019-11-06 10:47:11.387 JST [19737] DETAIL: Failed process was running: select pg_sleep(1200); 2019-11-06 10:47:11.387 JST [19737] LOCATION: LogChildExit, postmaster.c:3749 2019-11-06 10:47:11.387 JST [19737] LOG: 00000: terminating any other active server processes, because a server processexited abnormally and possibly corrupted shared memory. 2019-11-06 10:47:11.387 JST [19737] LOCATION: HandleChildCrash, postmaster.c:3469 2019-11-06 10:47:11.387 JST [19800] WARNING: 57P02: terminating connection due to command by postmaster 2019-11-06 10:47:11.387 JST [19800] LOCATION: quickdie, postgres.c:2736 > 2) At the end of crash recovery, and often when the startup processes > switches between WAL sources, we get scary messages like: > > 2019-11-05 15:20:21.907 PST [23407][] <> <> LOG: 00000: invalid record length at F/48C0A500: wanted 24, got 0 > 2019-11-05 15:20:21.907 PST [23407][] <> <> LOCATION: ReadRecord, xlog.c:4282 > or > 2019-11-05 15:35:03.321 PST [28518][1/0] LOG: invalid resource manager ID 52 at 3/7CD0B1B8 > > or any of the other types of xlogreader errors. > > I've seen countless tickets being raised because PG users looked at > their logs and got scared. We want them to look at the logs, so this > seems counter-productive. > > It seems we, at the very least, could add an error context or something > explaining that a LOG message about the log end is to be expected. Or > perhaps we could should reformulate the message to something like > 'ELEVEL: reached end of valid WAL at XX/XX' > 'DETAIL: end determined due to invalid record length: wanted 24, got 0' > > perhaps with a HINT in the elevel < ERROR case indicating that this is > not a failure. The proposed message seems far less scary. +1. > 3) When a standby node is shutdown in immediate mode, we issue: > > 2019-11-05 15:45:58.722 PST [30321][] LOG: database system was interrupted while in recovery at log time 2019-11-05 15:37:43PST > 2019-11-05 15:45:58.722 PST [30321][] HINT: If this has occurred more than once some data might be corrupted and you mightneed to choose an earlier recovery target. > > Which imo doesn't really make sense for at least standbys, which are > expected to be in recovery forever. There's more than enough scenarios > where one would shut down a standby in immediate mode for good reasons, > we shouldn't issue such warnings in that case. > > > The tricky part in improving this would be how to detect when the > message should still be issued for a standby. One idea, which is not > bullet proof but might be good enough, would be to record in the control > file which position recovery was started from last time, and only issue > the error when recovery would start from the same point. Recovery always starts from the latest REDO point. Maybe MinRecoveryPoint is better for the use. > I'm sure there are more types of messages in this category, these are > just the ones I could immediately recall from memory as having scared > actual users unnecessarily. I often see inquiries on "FATAL: the database system is starting up". It is actually FATAL for backends internally but it is also overly scary for users. > I don't plan on fixing these immediately myself, even if we were to > agree on something, so if anybody is interested in helping... regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Greetings, * Andres Freund (andres@anarazel.de) wrote: > There's a few errors that we issue that are, often, much less bad than > they sound. The most common cases I immediately can recall are: Agreed. > 1) Mentioning crash, once for each backend, when shutting down > immediately. Currently the log output for that, with just two sessions > connected, is the following: Eh, +1 or +2. > It seems like we ought to be able to somehow > > a) Signal that the server has been shut down in immediate mode, rather > than actually crashed, and issue a different log message to the user. That'd be nice, but I anticipate the question coming up of "how much is too much to do before an immediate shutdown?" > b) Stop issuing, to the server log, the same message over and over. We > instead just ought to send the message to the client. We however need to > be careful that we don't make it harder to debug a SIGQUIT sent directly > to backend processes. Would be nice if we could improve that, agreed. > 2) At the end of crash recovery, and often when the startup processes > switches between WAL sources, we get scary messages like: +1000 > I've seen countless tickets being raised because PG users looked at > their logs and got scared. We want them to look at the logs, so this > seems counter-productive. Yes, agreed, this happens all the time and would be good to improve. > It seems we, at the very least, could add an error context or something > explaining that a LOG message about the log end is to be expected. Or > perhaps we could should reformulate the message to something like > 'ELEVEL: reached end of valid WAL at XX/XX' > 'DETAIL: end determined due to invalid record length: wanted 24, got 0' > > perhaps with a HINT in the elevel < ERROR case indicating that this is > not a failure. Something like that does look like an improvement. > 3) When a standby node is shutdown in immediate mode, we issue: > > 2019-11-05 15:45:58.722 PST [30321][] LOG: database system was interrupted while in recovery at log time 2019-11-05 15:37:43PST > 2019-11-05 15:45:58.722 PST [30321][] HINT: If this has occurred more than once some data might be corrupted and you mightneed to choose an earlier recovery target. > > Which imo doesn't really make sense for at least standbys, which are > expected to be in recovery forever. There's more than enough scenarios > where one would shut down a standby in immediate mode for good reasons, > we shouldn't issue such warnings in that case. > > > The tricky part in improving this would be how to detect when the > message should still be issued for a standby. One idea, which is not > bullet proof but might be good enough, would be to record in the control > file which position recovery was started from last time, and only issue > the error when recovery would start from the same point. Yeah... This sounds like it would be more difficult to tackle, though I agree it'd be nice to improve on this too. Thanks, Stephen
Attachment
Hi, On 2019-11-05 22:00:58 -0500, Chapman Flack wrote: > On 11/05/19 18:54, Andres Freund wrote: > > Hi, > > > > There's a few errors that we issue that are, often, much less bad than > > they sound. The most common cases I immediately can recall are: > > > > > > 1) Mentioning crash, once for each backend, when shutting down > > immediately. Currently the log output for that, with just two sessions > > connected, is the following: > > While on the topic ... this may be more a property of particular > packagings of the server, to run under systemd etc., but often there > is a process during startup trying periodically to open a connection > to the server to confirm that it has successfully started, and the > result is a dozen or so log messages that say "FATAL: the server is > starting" ... which is amusing once you get what it's doing, but a bit > disconcerting until then. I think that is best solved by using pg_ctl's logic to look at the postmaster state file, rather than connecting before the server is ready. For one connecting requires to actually be able to connect, which isn't always a given. If using pg_ctl is problematic for some reason, it'd imo be better to extract the relevant logic into its own tool. > Not sure how that could be changed ... maybe a connection-time option > trial_connection that would suppress the fatal ereport on rejecting > the connection? I think that'd be a recipe for hard to debug issues. Imagine somebody DOSing the server and setting that option - you'd have no way to actually see what's happening. Greetings, Andres Freund
Hi, On 2019-11-06 17:36:09 +0900, Kyotaro Horiguchi wrote: > At Tue, 5 Nov 2019 15:54:22 -0800, Andres Freund <andres@anarazel.de> wrote in > > Hi, > > > > There's a few errors that we issue that are, often, much less bad than > > they sound. The most common cases I immediately can recall are: > > > > > > 1) Mentioning crash, once for each backend, when shutting down > > immediately. Currently the log output for that, with just two sessions > > connected, is the following: > > > > 2019-11-05 15:09:52.634 PST [9340][] LOG: 00000: received immediate shutdown request > > 2019-11-05 15:09:52.634 PST [9340][] LOCATION: pmdie, postmaster.c:2883 > > 2019-11-05 15:09:52.634 PST [23199][4/0] WARNING: 57P02: terminating connection because of crash of another server process > > 2019-11-05 15:09:52.634 PST [23199][4/0] DETAIL: The postmaster has commanded this server process to roll back the currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory. > > 2019-11-05 15:09:52.634 PST [23199][4/0] HINT: In a moment you should be able to reconnect to the database and repeatyour command. > > 2019-11-05 15:09:52.634 PST [23199][4/0] LOCATION: quickdie, postgres.c:2734 > > 2019-11-05 15:09:52.634 PST [23187][3/0] WARNING: 57P02: terminating connection because of crash of another server process > > 2019-11-05 15:09:52.634 PST [23187][3/0] DETAIL: The postmaster has commanded this server process to roll back the currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory. > > 2019-11-05 15:09:52.634 PST [23187][3/0] HINT: In a moment you should be able to reconnect to the database and repeatyour command. > > 2019-11-05 15:09:52.634 PST [23187][3/0] LOCATION: quickdie, postgres.c:2734 > > 2019-11-05 15:09:52.634 PST [9345][1/0] WARNING: 57P02: terminating connection because of crash of another server process > > 2019-11-05 15:09:52.634 PST [9345][1/0] DETAIL: The postmaster has commanded this server process to roll back the currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory. > > 2019-11-05 15:09:52.634 PST [9345][1/0] HINT: In a moment you should be able to reconnect to the database and repeatyour command. > > 2019-11-05 15:09:52.634 PST [9345][1/0] LOCATION: quickdie, postgres.c:2734 > > 2019-11-05 15:09:52.644 PST [9340][] LOG: 00000: database system is shut down > > 2019-11-05 15:09:52.644 PST [9340][] LOCATION: UnlinkLockFiles, miscinit.c:859 > > > > (23199, 23187 are backends, 9345 is the autovacuum launcher) > > > > I think there's multiple things wrong with this. For one, reading that > > the server has (no might in there) crashed, is scary, when all that's > > happening is an intentional shutdown. But also the sheer log volume is > > bad - on a busy server there may be a *lot* of these lines. > > > > I think the log volume is bad even for an actual PANIC. I've spent *way* > > too much time scrolling through pages and pages of the above lines, just > > to find the one or two lines indicating an actual error. > > > > It seems like we ought to be able to somehow > > > > a) Signal that the server has been shut down in immediate mode, rather > > than actually crashed, and issue a different log message to the user. > > > > b) Stop issuing, to the server log, the same message over and over. We > > instead just ought to send the message to the client. We however need to > > be careful that we don't make it harder to debug a SIGQUIT sent directly > > to backend processes. > > I doubt that different messages for server log and client worth > doing. Isn't it enough moving the cause description from backend > message to postmaster one? I'm not quite following what you're suggesting? > Addition to that, I don't see such a message on connecting psql. It > just reports as "server closed the connection unexpectedly" without a > server message. If I'm not missing something the HINT message is > useless.. It depends on the state of the connection IIRC, whether you'll get the error message or not. There's some cases where we might get notified about the connection having been closed, without actually reading the error message. There were some libpq fixes around this not too long ago, is it possible that you're running a version of psql linked against an older libpq version (e.g. from the OS)? I'm not a fan of the above error message libpq generates - it logs this in plenty cases where there was a network error. Talking about server crashes when simple network issues may be the reason imo is not a great idea. > > 3) When a standby node is shutdown in immediate mode, we issue: > > > > 2019-11-05 15:45:58.722 PST [30321][] LOG: database system was interrupted while in recovery at log time 2019-11-0515:37:43 PST > > 2019-11-05 15:45:58.722 PST [30321][] HINT: If this has occurred more than once some data might be corrupted and youmight need to choose an earlier recovery target. > > > > Which imo doesn't really make sense for at least standbys, which are > > expected to be in recovery forever. There's more than enough scenarios > > where one would shut down a standby in immediate mode for good reasons, > > we shouldn't issue such warnings in that case. > > > > > > The tricky part in improving this would be how to detect when the > > message should still be issued for a standby. One idea, which is not > > bullet proof but might be good enough, would be to record in the control > > file which position recovery was started from last time, and only issue > > the error when recovery would start from the same point. > > Recovery always starts from the latest REDO point. Well, not quite always, e.g. when restoring from a base backup. But otherwise, yea. But I'm not quite sure what your point is? We store the last checkpoint/redo position in the control file, and update it every checkpoint/restartpoint - so the last REDO pointer stored there, would not necessarily be the same as the redo pointer we started up from last? Am I misunderstanding? > Maybe MinRecoveryPoint is better for the use. What MinRecoveryPoint gets set is not deterministic - it depends e.g. which buffers get written out at what time. I don't quite see how we could make reliable use of it. > > I'm sure there are more types of messages in this category, these are > > just the ones I could immediately recall from memory as having scared > > actual users unnecessarily. > > I often see inquiries on "FATAL: the database system is starting > up". It is actually FATAL for backends internally but it is also > overly scary for users. We could probably just reformulate that error message and help users, without any larger changes... Greetings, Andres Freund