Re: Should we make scary sounding, but actually routine, errors lessscary? - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Should we make scary sounding, but actually routine, errors lessscary?
Date
Msg-id 20191108000011.psz2zbiry5jomvu2@alap3.anarazel.de
Whole thread Raw
In response to Re: Should we make scary sounding, but actually routine, errorsless scary?  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
List pgsql-hackers
Hi,

On 2019-11-06 17:36:09 +0900, Kyotaro Horiguchi wrote:
> At Tue, 5 Nov 2019 15:54:22 -0800, Andres Freund <andres@anarazel.de> wrote in 
> > Hi,
> > 
> > There's a few errors that we issue that are, often, much less bad than
> > they sound. The most common cases I immediately can recall are:
> > 
> > 
> > 1) Mentioning crash, once for each backend, when shutting down
> > immediately. Currently the log output for that, with just two sessions
> > connected, is the following:
> > 
> > 2019-11-05 15:09:52.634 PST [9340][] LOG:  00000: received immediate shutdown request
> > 2019-11-05 15:09:52.634 PST [9340][] LOCATION:  pmdie, postmaster.c:2883
> > 2019-11-05 15:09:52.634 PST [23199][4/0] WARNING:  57P02: terminating connection because of crash of another server
process
> > 2019-11-05 15:09:52.634 PST [23199][4/0] DETAIL:  The postmaster has commanded this server process to roll back the
currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
 
> > 2019-11-05 15:09:52.634 PST [23199][4/0] HINT:  In a moment you should be able to reconnect to the database and
repeatyour command.
 
> > 2019-11-05 15:09:52.634 PST [23199][4/0] LOCATION:  quickdie, postgres.c:2734
> > 2019-11-05 15:09:52.634 PST [23187][3/0] WARNING:  57P02: terminating connection because of crash of another server
process
> > 2019-11-05 15:09:52.634 PST [23187][3/0] DETAIL:  The postmaster has commanded this server process to roll back the
currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
 
> > 2019-11-05 15:09:52.634 PST [23187][3/0] HINT:  In a moment you should be able to reconnect to the database and
repeatyour command.
 
> > 2019-11-05 15:09:52.634 PST [23187][3/0] LOCATION:  quickdie, postgres.c:2734
> > 2019-11-05 15:09:52.634 PST [9345][1/0] WARNING:  57P02: terminating connection because of crash of another server
process
> > 2019-11-05 15:09:52.634 PST [9345][1/0] DETAIL:  The postmaster has commanded this server process to roll back the
currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
 
> > 2019-11-05 15:09:52.634 PST [9345][1/0] HINT:  In a moment you should be able to reconnect to the database and
repeatyour command.
 
> > 2019-11-05 15:09:52.634 PST [9345][1/0] LOCATION:  quickdie, postgres.c:2734
> > 2019-11-05 15:09:52.644 PST [9340][] LOG:  00000: database system is shut down
> > 2019-11-05 15:09:52.644 PST [9340][] LOCATION:  UnlinkLockFiles, miscinit.c:859
> > 
> > (23199, 23187 are backends, 9345 is the autovacuum launcher)
> > 
> > I think there's multiple things wrong with this. For one, reading that
> > the server has (no might in there) crashed, is scary, when all that's
> > happening is an intentional shutdown.  But also the sheer log volume is
> > bad - on a busy server there may be a *lot* of these lines.
> > 
> > I think the log volume is bad even for an actual PANIC. I've spent *way*
> > too much time scrolling through pages and pages of the above lines, just
> > to find the one or two lines indicating an actual error.
> >
> > It seems like we ought to be able to somehow
> > 
> > a) Signal that the server has been shut down in immediate mode, rather
> > than actually crashed, and issue a different log message to the user.
> > 
> > b) Stop issuing, to the server log, the same message over and over. We
> > instead just ought to send the message to the client. We however need to
> > be careful that we don't make it harder to debug a SIGQUIT sent directly
> > to backend processes.
> 
> I doubt that different messages for server log and client worth
> doing. Isn't it enough moving the cause description from backend
> message to postmaster one?

I'm not quite following what you're suggesting?


> Addition to that, I don't see such a message on connecting psql. It
> just reports as "server closed the connection unexpectedly" without a
> server message. If I'm not missing something the HINT message is
> useless..

It depends on the state of the connection IIRC, whether you'll get the
error message or not. There's some cases where we might get notified
about the connection having been closed, without actually reading the
error message.  There were some libpq fixes around this not too long
ago, is it possible that you're running a version of psql linked against
an older libpq version (e.g. from the OS)?


I'm not a fan of the above error message libpq generates - it logs this
in plenty cases where there was a network error. Talking about server
crashes when simple network issues may be the reason imo is not a great
idea.




> > 3) When a standby node is shutdown in immediate mode, we issue:
> > 
> > 2019-11-05 15:45:58.722 PST [30321][] LOG:  database system was interrupted while in recovery at log time
2019-11-0515:37:43 PST
 
> > 2019-11-05 15:45:58.722 PST [30321][] HINT:  If this has occurred more than once some data might be corrupted and
youmight need to choose an earlier recovery target.
 
> > 
> > Which imo doesn't really make sense for at least standbys, which are
> > expected to be in recovery forever. There's more than enough scenarios
> > where one would shut down a standby in immediate mode for good reasons,
> > we shouldn't issue such warnings in that case.
> > 
> > 
> > The tricky part in improving this would be how to detect when the
> > message should still be issued for a standby. One idea, which is not
> > bullet proof but might be good enough, would be to record in the control
> > file which position recovery was started from last time, and only issue
> > the error when recovery would start from the same point.
> 
> Recovery always starts from the latest REDO point.

Well, not quite always, e.g. when restoring from a base backup. But
otherwise, yea. But I'm not quite sure what your point is? We store the
last checkpoint/redo position in the control file, and update it every
checkpoint/restartpoint - so the last REDO pointer stored there, would
not necessarily be the same as the redo pointer we started up from last?

Am I misunderstanding?


> Maybe MinRecoveryPoint is better for the use.

What MinRecoveryPoint gets set is not deterministic - it depends
e.g. which buffers get written out at what time. I don't quite see how
we could make reliable use of it.


> > I'm sure there are more types of messages in this category, these are
> > just the ones I could immediately recall from memory as having scared
> > actual users unnecessarily.
> 
> I often see inquiries on "FATAL: the database system is starting
> up". It is actually FATAL for backends internally but it is also
> overly scary for users.

We could probably just reformulate that error message and help users,
without any larger changes...

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Should we make scary sounding, but actually routine, errors lessscary?
Next
From: Michael Paquier
Date:
Subject: Re: SKIP_LOCKED test causes random buildfarm failures