Re: Should we make scary sounding, but actually routine, errorsless scary? - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: Should we make scary sounding, but actually routine, errorsless scary?
Date
Msg-id 20191106.173609.355066219628074584.horikyota.ntt@gmail.com
Whole thread Raw
In response to Should we make scary sounding, but actually routine, errors lessscary?  (Andres Freund <andres@anarazel.de>)
Responses Re: Should we make scary sounding, but actually routine, errors lessscary?  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
At Tue, 5 Nov 2019 15:54:22 -0800, Andres Freund <andres@anarazel.de> wrote in 
> Hi,
> 
> There's a few errors that we issue that are, often, much less bad than
> they sound. The most common cases I immediately can recall are:
> 
> 
> 1) Mentioning crash, once for each backend, when shutting down
> immediately. Currently the log output for that, with just two sessions
> connected, is the following:
> 
> 2019-11-05 15:09:52.634 PST [9340][] LOG:  00000: received immediate shutdown request
> 2019-11-05 15:09:52.634 PST [9340][] LOCATION:  pmdie, postmaster.c:2883
> 2019-11-05 15:09:52.634 PST [23199][4/0] WARNING:  57P02: terminating connection because of crash of another server
process
> 2019-11-05 15:09:52.634 PST [23199][4/0] DETAIL:  The postmaster has commanded this server process to roll back the
currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
 
> 2019-11-05 15:09:52.634 PST [23199][4/0] HINT:  In a moment you should be able to reconnect to the database and
repeatyour command.
 
> 2019-11-05 15:09:52.634 PST [23199][4/0] LOCATION:  quickdie, postgres.c:2734
> 2019-11-05 15:09:52.634 PST [23187][3/0] WARNING:  57P02: terminating connection because of crash of another server
process
> 2019-11-05 15:09:52.634 PST [23187][3/0] DETAIL:  The postmaster has commanded this server process to roll back the
currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
 
> 2019-11-05 15:09:52.634 PST [23187][3/0] HINT:  In a moment you should be able to reconnect to the database and
repeatyour command.
 
> 2019-11-05 15:09:52.634 PST [23187][3/0] LOCATION:  quickdie, postgres.c:2734
> 2019-11-05 15:09:52.634 PST [9345][1/0] WARNING:  57P02: terminating connection because of crash of another server
process
> 2019-11-05 15:09:52.634 PST [9345][1/0] DETAIL:  The postmaster has commanded this server process to roll back the
currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
 
> 2019-11-05 15:09:52.634 PST [9345][1/0] HINT:  In a moment you should be able to reconnect to the database and repeat
yourcommand.
 
> 2019-11-05 15:09:52.634 PST [9345][1/0] LOCATION:  quickdie, postgres.c:2734
> 2019-11-05 15:09:52.644 PST [9340][] LOG:  00000: database system is shut down
> 2019-11-05 15:09:52.644 PST [9340][] LOCATION:  UnlinkLockFiles, miscinit.c:859
> 
> (23199, 23187 are backends, 9345 is the autovacuum launcher)
> 
> I think there's multiple things wrong with this. For one, reading that
> the server has (no might in there) crashed, is scary, when all that's
> happening is an intentional shutdown.  But also the sheer log volume is
> bad - on a busy server there may be a *lot* of these lines.
> 
> I think the log volume is bad even for an actual PANIC. I've spent *way*
> too much time scrolling through pages and pages of the above lines, just
> to find the one or two lines indicating an actual error.
>
> It seems like we ought to be able to somehow
> 
> a) Signal that the server has been shut down in immediate mode, rather
> than actually crashed, and issue a different log message to the user.
> 
> b) Stop issuing, to the server log, the same message over and over. We
> instead just ought to send the message to the client. We however need to
> be careful that we don't make it harder to debug a SIGQUIT sent directly
> to backend processes.

I doubt that different messages for server log and client worth
doing. Isn't it enough moving the cause description from backend
message to postmaster one?

Addition to that, I don't see such a message on connecting psql. It
just reports as "server closed the connection unexpectedly" without a
server message. If I'm not missing something the HINT message is
useless..


  2019-11-05 15:09:52.634 PST [9340][] LOG:  00000: received immediate shutdown request
  2019-11-05 15:09:52.634 PST [9340][] LOCATION:  pmdie, postmaster.c:2883
+ 2019-11-05 15:09:52.634 PST [9340][] LOG:  terminating all active server processes
- 2019-11-05 15:09:52.634 PST [23199][4/0] WARNING:  57P02: terminating connection because of crash of another server
process
+ 2019-11-05 15:09:52.634 PST [23199][4/0] WARNING:  57P02: terminating connection due to command by postmaster
- 2019-11-05 15:09:52.634 PST [23199][1/0] DETAIL:  The postmaster has commanded this server process to roll back the
currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
 
- 2019-11-05 15:09:52.634 PST [23199][1/0] HINT:  In a moment you should be able to reconnect to the database and
repeatyour command.
 

If some process has crashed, it looks like:

2019-11-06 10:47:11.387 JST [19737] LOG:  00000: server process (PID 19774) was terminated by signal 11: Segmentation
fault
2019-11-06 10:47:11.387 JST [19737] DETAIL:  Failed process was running: select pg_sleep(1200);
2019-11-06 10:47:11.387 JST [19737] LOCATION:  LogChildExit, postmaster.c:3749
2019-11-06 10:47:11.387 JST [19737] LOG:  00000: terminating any other active server processes, because a server
processexited abnormally and possibly corrupted shared memory.
 
2019-11-06 10:47:11.387 JST [19737] LOCATION:  HandleChildCrash, postmaster.c:3469
2019-11-06 10:47:11.387 JST [19800] WARNING:  57P02: terminating connection due to command by postmaster
2019-11-06 10:47:11.387 JST [19800] LOCATION:  quickdie, postgres.c:2736

> 2) At the end of crash recovery, and often when the startup processes
> switches between WAL sources, we get scary messages like:
> 
> 2019-11-05 15:20:21.907 PST [23407][] <> <> LOG:  00000: invalid record length at F/48C0A500: wanted 24, got 0
> 2019-11-05 15:20:21.907 PST [23407][] <> <> LOCATION:  ReadRecord, xlog.c:4282
> or
> 2019-11-05 15:35:03.321 PST [28518][1/0] LOG:  invalid resource manager ID 52 at 3/7CD0B1B8
> 
> or any of the other types of xlogreader errors.
> 
> I've seen countless tickets being raised because PG users looked at
> their logs and got scared. We want them to look at the logs, so this
> seems counter-productive.
> 
> It seems we, at the very least, could add an error context or something
> explaining that a LOG message about the log end is to be expected. Or
> perhaps we could should reformulate the message to something like
> 'ELEVEL: reached end of valid WAL at XX/XX'
> 'DETAIL: end determined due to invalid record length: wanted 24, got 0'
> 
> perhaps with a HINT in the elevel < ERROR case indicating that this is
> not a failure.

The proposed message seems far less scary. +1.

> 3) When a standby node is shutdown in immediate mode, we issue:
> 
> 2019-11-05 15:45:58.722 PST [30321][] LOG:  database system was interrupted while in recovery at log time 2019-11-05
15:37:43PST
 
> 2019-11-05 15:45:58.722 PST [30321][] HINT:  If this has occurred more than once some data might be corrupted and you
mightneed to choose an earlier recovery target.
 
> 
> Which imo doesn't really make sense for at least standbys, which are
> expected to be in recovery forever. There's more than enough scenarios
> where one would shut down a standby in immediate mode for good reasons,
> we shouldn't issue such warnings in that case.
> 
> 
> The tricky part in improving this would be how to detect when the
> message should still be issued for a standby. One idea, which is not
> bullet proof but might be good enough, would be to record in the control
> file which position recovery was started from last time, and only issue
> the error when recovery would start from the same point.

Recovery always starts from the latest REDO point. Maybe
MinRecoveryPoint is better for the use.

> I'm sure there are more types of messages in this category, these are
> just the ones I could immediately recall from memory as having scared
> actual users unnecessarily.

I often see inquiries on "FATAL: the database system is starting
up". It is actually FATAL for backends internally but it is also
overly scary for users.

> I don't plan on fixing these immediately myself, even if we were to
> agree on something, so if anybody is interested in helping...

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: cost based vacuum (parallel)
Next
From: Peter Eisentraut
Date:
Subject: Re: tableam vs. TOAST