Re: Should we make scary sounding, but actually routine, errorsless scary? - Mailing list pgsql-hackers
From | Kyotaro Horiguchi |
---|---|
Subject | Re: Should we make scary sounding, but actually routine, errorsless scary? |
Date | |
Msg-id | 20191106.173609.355066219628074584.horikyota.ntt@gmail.com Whole thread Raw |
In response to | Should we make scary sounding, but actually routine, errors lessscary? (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Should we make scary sounding, but actually routine, errors lessscary?
|
List | pgsql-hackers |
At Tue, 5 Nov 2019 15:54:22 -0800, Andres Freund <andres@anarazel.de> wrote in > Hi, > > There's a few errors that we issue that are, often, much less bad than > they sound. The most common cases I immediately can recall are: > > > 1) Mentioning crash, once for each backend, when shutting down > immediately. Currently the log output for that, with just two sessions > connected, is the following: > > 2019-11-05 15:09:52.634 PST [9340][] LOG: 00000: received immediate shutdown request > 2019-11-05 15:09:52.634 PST [9340][] LOCATION: pmdie, postmaster.c:2883 > 2019-11-05 15:09:52.634 PST [23199][4/0] WARNING: 57P02: terminating connection because of crash of another server process > 2019-11-05 15:09:52.634 PST [23199][4/0] DETAIL: The postmaster has commanded this server process to roll back the currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory. > 2019-11-05 15:09:52.634 PST [23199][4/0] HINT: In a moment you should be able to reconnect to the database and repeatyour command. > 2019-11-05 15:09:52.634 PST [23199][4/0] LOCATION: quickdie, postgres.c:2734 > 2019-11-05 15:09:52.634 PST [23187][3/0] WARNING: 57P02: terminating connection because of crash of another server process > 2019-11-05 15:09:52.634 PST [23187][3/0] DETAIL: The postmaster has commanded this server process to roll back the currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory. > 2019-11-05 15:09:52.634 PST [23187][3/0] HINT: In a moment you should be able to reconnect to the database and repeatyour command. > 2019-11-05 15:09:52.634 PST [23187][3/0] LOCATION: quickdie, postgres.c:2734 > 2019-11-05 15:09:52.634 PST [9345][1/0] WARNING: 57P02: terminating connection because of crash of another server process > 2019-11-05 15:09:52.634 PST [9345][1/0] DETAIL: The postmaster has commanded this server process to roll back the currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory. > 2019-11-05 15:09:52.634 PST [9345][1/0] HINT: In a moment you should be able to reconnect to the database and repeat yourcommand. > 2019-11-05 15:09:52.634 PST [9345][1/0] LOCATION: quickdie, postgres.c:2734 > 2019-11-05 15:09:52.644 PST [9340][] LOG: 00000: database system is shut down > 2019-11-05 15:09:52.644 PST [9340][] LOCATION: UnlinkLockFiles, miscinit.c:859 > > (23199, 23187 are backends, 9345 is the autovacuum launcher) > > I think there's multiple things wrong with this. For one, reading that > the server has (no might in there) crashed, is scary, when all that's > happening is an intentional shutdown. But also the sheer log volume is > bad - on a busy server there may be a *lot* of these lines. > > I think the log volume is bad even for an actual PANIC. I've spent *way* > too much time scrolling through pages and pages of the above lines, just > to find the one or two lines indicating an actual error. > > It seems like we ought to be able to somehow > > a) Signal that the server has been shut down in immediate mode, rather > than actually crashed, and issue a different log message to the user. > > b) Stop issuing, to the server log, the same message over and over. We > instead just ought to send the message to the client. We however need to > be careful that we don't make it harder to debug a SIGQUIT sent directly > to backend processes. I doubt that different messages for server log and client worth doing. Isn't it enough moving the cause description from backend message to postmaster one? Addition to that, I don't see such a message on connecting psql. It just reports as "server closed the connection unexpectedly" without a server message. If I'm not missing something the HINT message is useless.. 2019-11-05 15:09:52.634 PST [9340][] LOG: 00000: received immediate shutdown request 2019-11-05 15:09:52.634 PST [9340][] LOCATION: pmdie, postmaster.c:2883 + 2019-11-05 15:09:52.634 PST [9340][] LOG: terminating all active server processes - 2019-11-05 15:09:52.634 PST [23199][4/0] WARNING: 57P02: terminating connection because of crash of another server process + 2019-11-05 15:09:52.634 PST [23199][4/0] WARNING: 57P02: terminating connection due to command by postmaster - 2019-11-05 15:09:52.634 PST [23199][1/0] DETAIL: The postmaster has commanded this server process to roll back the currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory. - 2019-11-05 15:09:52.634 PST [23199][1/0] HINT: In a moment you should be able to reconnect to the database and repeatyour command. If some process has crashed, it looks like: 2019-11-06 10:47:11.387 JST [19737] LOG: 00000: server process (PID 19774) was terminated by signal 11: Segmentation fault 2019-11-06 10:47:11.387 JST [19737] DETAIL: Failed process was running: select pg_sleep(1200); 2019-11-06 10:47:11.387 JST [19737] LOCATION: LogChildExit, postmaster.c:3749 2019-11-06 10:47:11.387 JST [19737] LOG: 00000: terminating any other active server processes, because a server processexited abnormally and possibly corrupted shared memory. 2019-11-06 10:47:11.387 JST [19737] LOCATION: HandleChildCrash, postmaster.c:3469 2019-11-06 10:47:11.387 JST [19800] WARNING: 57P02: terminating connection due to command by postmaster 2019-11-06 10:47:11.387 JST [19800] LOCATION: quickdie, postgres.c:2736 > 2) At the end of crash recovery, and often when the startup processes > switches between WAL sources, we get scary messages like: > > 2019-11-05 15:20:21.907 PST [23407][] <> <> LOG: 00000: invalid record length at F/48C0A500: wanted 24, got 0 > 2019-11-05 15:20:21.907 PST [23407][] <> <> LOCATION: ReadRecord, xlog.c:4282 > or > 2019-11-05 15:35:03.321 PST [28518][1/0] LOG: invalid resource manager ID 52 at 3/7CD0B1B8 > > or any of the other types of xlogreader errors. > > I've seen countless tickets being raised because PG users looked at > their logs and got scared. We want them to look at the logs, so this > seems counter-productive. > > It seems we, at the very least, could add an error context or something > explaining that a LOG message about the log end is to be expected. Or > perhaps we could should reformulate the message to something like > 'ELEVEL: reached end of valid WAL at XX/XX' > 'DETAIL: end determined due to invalid record length: wanted 24, got 0' > > perhaps with a HINT in the elevel < ERROR case indicating that this is > not a failure. The proposed message seems far less scary. +1. > 3) When a standby node is shutdown in immediate mode, we issue: > > 2019-11-05 15:45:58.722 PST [30321][] LOG: database system was interrupted while in recovery at log time 2019-11-05 15:37:43PST > 2019-11-05 15:45:58.722 PST [30321][] HINT: If this has occurred more than once some data might be corrupted and you mightneed to choose an earlier recovery target. > > Which imo doesn't really make sense for at least standbys, which are > expected to be in recovery forever. There's more than enough scenarios > where one would shut down a standby in immediate mode for good reasons, > we shouldn't issue such warnings in that case. > > > The tricky part in improving this would be how to detect when the > message should still be issued for a standby. One idea, which is not > bullet proof but might be good enough, would be to record in the control > file which position recovery was started from last time, and only issue > the error when recovery would start from the same point. Recovery always starts from the latest REDO point. Maybe MinRecoveryPoint is better for the use. > I'm sure there are more types of messages in this category, these are > just the ones I could immediately recall from memory as having scared > actual users unnecessarily. I often see inquiries on "FATAL: the database system is starting up". It is actually FATAL for backends internally but it is also overly scary for users. > I don't plan on fixing these immediately myself, even if we were to > agree on something, so if anybody is interested in helping... regards. -- Kyotaro Horiguchi NTT Open Source Software Center
pgsql-hackers by date: