Should we make scary sounding, but actually routine, errors lessscary? - Mailing list pgsql-hackers

From Andres Freund
Subject Should we make scary sounding, but actually routine, errors lessscary?
Date
Msg-id 20191105235422.thf77mhjsaczaa23@alap3.anarazel.de
Whole thread Raw
Responses Re: Should we make scary sounding, but actually routine, errors lessscary?  (Chapman Flack <chap@anastigmatix.net>)
Re: Should we make scary sounding, but actually routine, errorsless scary?  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Re: Should we make scary sounding, but actually routine, errors lessscary?  (Stephen Frost <sfrost@snowman.net>)
List pgsql-hackers
Hi,

There's a few errors that we issue that are, often, much less bad than
they sound. The most common cases I immediately can recall are:


1) Mentioning crash, once for each backend, when shutting down
immediately. Currently the log output for that, with just two sessions
connected, is the following:

2019-11-05 15:09:52.634 PST [9340][] LOG:  00000: received immediate shutdown request
2019-11-05 15:09:52.634 PST [9340][] LOCATION:  pmdie, postmaster.c:2883
2019-11-05 15:09:52.634 PST [23199][4/0] WARNING:  57P02: terminating connection because of crash of another server
process
2019-11-05 15:09:52.634 PST [23199][4/0] DETAIL:  The postmaster has commanded this server process to roll back the
currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
 
2019-11-05 15:09:52.634 PST [23199][4/0] HINT:  In a moment you should be able to reconnect to the database and repeat
yourcommand.
 
2019-11-05 15:09:52.634 PST [23199][4/0] LOCATION:  quickdie, postgres.c:2734
2019-11-05 15:09:52.634 PST [23187][3/0] WARNING:  57P02: terminating connection because of crash of another server
process
2019-11-05 15:09:52.634 PST [23187][3/0] DETAIL:  The postmaster has commanded this server process to roll back the
currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
 
2019-11-05 15:09:52.634 PST [23187][3/0] HINT:  In a moment you should be able to reconnect to the database and repeat
yourcommand.
 
2019-11-05 15:09:52.634 PST [23187][3/0] LOCATION:  quickdie, postgres.c:2734
2019-11-05 15:09:52.634 PST [9345][1/0] WARNING:  57P02: terminating connection because of crash of another server
process
2019-11-05 15:09:52.634 PST [9345][1/0] DETAIL:  The postmaster has commanded this server process to roll back the
currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
 
2019-11-05 15:09:52.634 PST [9345][1/0] HINT:  In a moment you should be able to reconnect to the database and repeat
yourcommand.
 
2019-11-05 15:09:52.634 PST [9345][1/0] LOCATION:  quickdie, postgres.c:2734
2019-11-05 15:09:52.644 PST [9340][] LOG:  00000: database system is shut down
2019-11-05 15:09:52.644 PST [9340][] LOCATION:  UnlinkLockFiles, miscinit.c:859

(23199, 23187 are backends, 9345 is the autovacuum launcher)

I think there's multiple things wrong with this. For one, reading that
the server has (no might in there) crashed, is scary, when all that's
happening is an intentional shutdown.  But also the sheer log volume is
bad - on a busy server there may be a *lot* of these lines.

I think the log volume is bad even for an actual PANIC. I've spent *way*
too much time scrolling through pages and pages of the above lines, just
to find the one or two lines indicating an actual error.


It seems like we ought to be able to somehow

a) Signal that the server has been shut down in immediate mode, rather
than actually crashed, and issue a different log message to the user.

b) Stop issuing, to the server log, the same message over and over. We
instead just ought to send the message to the client. We however need to
be careful that we don't make it harder to debug a SIGQUIT sent directly
to backend processes.


2) At the end of crash recovery, and often when the startup processes
switches between WAL sources, we get scary messages like:

2019-11-05 15:20:21.907 PST [23407][] <> <> LOG:  00000: invalid record length at F/48C0A500: wanted 24, got 0
2019-11-05 15:20:21.907 PST [23407][] <> <> LOCATION:  ReadRecord, xlog.c:4282
or
2019-11-05 15:35:03.321 PST [28518][1/0] LOG:  invalid resource manager ID 52 at 3/7CD0B1B8

or any of the other types of xlogreader errors.

I've seen countless tickets being raised because PG users looked at
their logs and got scared. We want them to look at the logs, so this
seems counter-productive.

It seems we, at the very least, could add an error context or something
explaining that a LOG message about the log end is to be expected. Or
perhaps we could should reformulate the message to something like
'ELEVEL: reached end of valid WAL at XX/XX'
'DETAIL: end determined due to invalid record length: wanted 24, got 0'

perhaps with a HINT in the elevel < ERROR case indicating that this is
not a failure.


3) When a standby node is shutdown in immediate mode, we issue:

2019-11-05 15:45:58.722 PST [30321][] LOG:  database system was interrupted while in recovery at log time 2019-11-05
15:37:43PST
 
2019-11-05 15:45:58.722 PST [30321][] HINT:  If this has occurred more than once some data might be corrupted and you
mightneed to choose an earlier recovery target.
 

Which imo doesn't really make sense for at least standbys, which are
expected to be in recovery forever. There's more than enough scenarios
where one would shut down a standby in immediate mode for good reasons,
we shouldn't issue such warnings in that case.


The tricky part in improving this would be how to detect when the
message should still be issued for a standby. One idea, which is not
bullet proof but might be good enough, would be to record in the control
file which position recovery was started from last time, and only issue
the error when recovery would start from the same point.



I'm sure there are more types of messages in this category, these are
just the ones I could immediately recall from memory as having scared
actual users unnecessarily.


I don't plan on fixing these immediately myself, even if we were to
agree on something, so if anybody is interested in helping...

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Duplicate entries in pg_depend after REINDEX CONCURRENTLY
Next
From: Andres Freund
Date:
Subject: Re: Make StringInfo available to frontend code.