Thread: Should we make scary sounding, but actually routine, errors lessscary?

Should we make scary sounding, but actually routine, errors lessscary?

From
Andres Freund
Date:
Hi,

There's a few errors that we issue that are, often, much less bad than
they sound. The most common cases I immediately can recall are:


1) Mentioning crash, once for each backend, when shutting down
immediately. Currently the log output for that, with just two sessions
connected, is the following:

2019-11-05 15:09:52.634 PST [9340][] LOG:  00000: received immediate shutdown request
2019-11-05 15:09:52.634 PST [9340][] LOCATION:  pmdie, postmaster.c:2883
2019-11-05 15:09:52.634 PST [23199][4/0] WARNING:  57P02: terminating connection because of crash of another server
process
2019-11-05 15:09:52.634 PST [23199][4/0] DETAIL:  The postmaster has commanded this server process to roll back the
currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
 
2019-11-05 15:09:52.634 PST [23199][4/0] HINT:  In a moment you should be able to reconnect to the database and repeat
yourcommand.
 
2019-11-05 15:09:52.634 PST [23199][4/0] LOCATION:  quickdie, postgres.c:2734
2019-11-05 15:09:52.634 PST [23187][3/0] WARNING:  57P02: terminating connection because of crash of another server
process
2019-11-05 15:09:52.634 PST [23187][3/0] DETAIL:  The postmaster has commanded this server process to roll back the
currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
 
2019-11-05 15:09:52.634 PST [23187][3/0] HINT:  In a moment you should be able to reconnect to the database and repeat
yourcommand.
 
2019-11-05 15:09:52.634 PST [23187][3/0] LOCATION:  quickdie, postgres.c:2734
2019-11-05 15:09:52.634 PST [9345][1/0] WARNING:  57P02: terminating connection because of crash of another server
process
2019-11-05 15:09:52.634 PST [9345][1/0] DETAIL:  The postmaster has commanded this server process to roll back the
currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
 
2019-11-05 15:09:52.634 PST [9345][1/0] HINT:  In a moment you should be able to reconnect to the database and repeat
yourcommand.
 
2019-11-05 15:09:52.634 PST [9345][1/0] LOCATION:  quickdie, postgres.c:2734
2019-11-05 15:09:52.644 PST [9340][] LOG:  00000: database system is shut down
2019-11-05 15:09:52.644 PST [9340][] LOCATION:  UnlinkLockFiles, miscinit.c:859

(23199, 23187 are backends, 9345 is the autovacuum launcher)

I think there's multiple things wrong with this. For one, reading that
the server has (no might in there) crashed, is scary, when all that's
happening is an intentional shutdown.  But also the sheer log volume is
bad - on a busy server there may be a *lot* of these lines.

I think the log volume is bad even for an actual PANIC. I've spent *way*
too much time scrolling through pages and pages of the above lines, just
to find the one or two lines indicating an actual error.


It seems like we ought to be able to somehow

a) Signal that the server has been shut down in immediate mode, rather
than actually crashed, and issue a different log message to the user.

b) Stop issuing, to the server log, the same message over and over. We
instead just ought to send the message to the client. We however need to
be careful that we don't make it harder to debug a SIGQUIT sent directly
to backend processes.


2) At the end of crash recovery, and often when the startup processes
switches between WAL sources, we get scary messages like:

2019-11-05 15:20:21.907 PST [23407][] <> <> LOG:  00000: invalid record length at F/48C0A500: wanted 24, got 0
2019-11-05 15:20:21.907 PST [23407][] <> <> LOCATION:  ReadRecord, xlog.c:4282
or
2019-11-05 15:35:03.321 PST [28518][1/0] LOG:  invalid resource manager ID 52 at 3/7CD0B1B8

or any of the other types of xlogreader errors.

I've seen countless tickets being raised because PG users looked at
their logs and got scared. We want them to look at the logs, so this
seems counter-productive.

It seems we, at the very least, could add an error context or something
explaining that a LOG message about the log end is to be expected. Or
perhaps we could should reformulate the message to something like
'ELEVEL: reached end of valid WAL at XX/XX'
'DETAIL: end determined due to invalid record length: wanted 24, got 0'

perhaps with a HINT in the elevel < ERROR case indicating that this is
not a failure.


3) When a standby node is shutdown in immediate mode, we issue:

2019-11-05 15:45:58.722 PST [30321][] LOG:  database system was interrupted while in recovery at log time 2019-11-05
15:37:43PST
 
2019-11-05 15:45:58.722 PST [30321][] HINT:  If this has occurred more than once some data might be corrupted and you
mightneed to choose an earlier recovery target.
 

Which imo doesn't really make sense for at least standbys, which are
expected to be in recovery forever. There's more than enough scenarios
where one would shut down a standby in immediate mode for good reasons,
we shouldn't issue such warnings in that case.


The tricky part in improving this would be how to detect when the
message should still be issued for a standby. One idea, which is not
bullet proof but might be good enough, would be to record in the control
file which position recovery was started from last time, and only issue
the error when recovery would start from the same point.



I'm sure there are more types of messages in this category, these are
just the ones I could immediately recall from memory as having scared
actual users unnecessarily.


I don't plan on fixing these immediately myself, even if we were to
agree on something, so if anybody is interested in helping...

Greetings,

Andres Freund



Re: Should we make scary sounding, but actually routine, errors lessscary?

From
Chapman Flack
Date:
On 11/05/19 18:54, Andres Freund wrote:
> Hi,
> 
> There's a few errors that we issue that are, often, much less bad than
> they sound. The most common cases I immediately can recall are:
> 
> 
> 1) Mentioning crash, once for each backend, when shutting down
> immediately. Currently the log output for that, with just two sessions
> connected, is the following:

While on the topic ... this may be more a property of particular
packagings of the server, to run under systemd etc., but often there
is a process during startup trying periodically to open a connection
to the server to confirm that it has successfully started, and the
result is a dozen or so log messages that say "FATAL: the server is
starting" ... which is amusing once you get what it's doing, but a bit
disconcerting until then.

Not sure how that could be changed ... maybe a connection-time option
trial_connection that would suppress the fatal ereport on rejecting
the connection?

Regards,
-Chap



Re: Should we make scary sounding, but actually routine, errorsless scary?

From
Kyotaro Horiguchi
Date:
At Tue, 5 Nov 2019 15:54:22 -0800, Andres Freund <andres@anarazel.de> wrote in 
> Hi,
> 
> There's a few errors that we issue that are, often, much less bad than
> they sound. The most common cases I immediately can recall are:
> 
> 
> 1) Mentioning crash, once for each backend, when shutting down
> immediately. Currently the log output for that, with just two sessions
> connected, is the following:
> 
> 2019-11-05 15:09:52.634 PST [9340][] LOG:  00000: received immediate shutdown request
> 2019-11-05 15:09:52.634 PST [9340][] LOCATION:  pmdie, postmaster.c:2883
> 2019-11-05 15:09:52.634 PST [23199][4/0] WARNING:  57P02: terminating connection because of crash of another server
process
> 2019-11-05 15:09:52.634 PST [23199][4/0] DETAIL:  The postmaster has commanded this server process to roll back the
currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
 
> 2019-11-05 15:09:52.634 PST [23199][4/0] HINT:  In a moment you should be able to reconnect to the database and
repeatyour command.
 
> 2019-11-05 15:09:52.634 PST [23199][4/0] LOCATION:  quickdie, postgres.c:2734
> 2019-11-05 15:09:52.634 PST [23187][3/0] WARNING:  57P02: terminating connection because of crash of another server
process
> 2019-11-05 15:09:52.634 PST [23187][3/0] DETAIL:  The postmaster has commanded this server process to roll back the
currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
 
> 2019-11-05 15:09:52.634 PST [23187][3/0] HINT:  In a moment you should be able to reconnect to the database and
repeatyour command.
 
> 2019-11-05 15:09:52.634 PST [23187][3/0] LOCATION:  quickdie, postgres.c:2734
> 2019-11-05 15:09:52.634 PST [9345][1/0] WARNING:  57P02: terminating connection because of crash of another server
process
> 2019-11-05 15:09:52.634 PST [9345][1/0] DETAIL:  The postmaster has commanded this server process to roll back the
currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
 
> 2019-11-05 15:09:52.634 PST [9345][1/0] HINT:  In a moment you should be able to reconnect to the database and repeat
yourcommand.
 
> 2019-11-05 15:09:52.634 PST [9345][1/0] LOCATION:  quickdie, postgres.c:2734
> 2019-11-05 15:09:52.644 PST [9340][] LOG:  00000: database system is shut down
> 2019-11-05 15:09:52.644 PST [9340][] LOCATION:  UnlinkLockFiles, miscinit.c:859
> 
> (23199, 23187 are backends, 9345 is the autovacuum launcher)
> 
> I think there's multiple things wrong with this. For one, reading that
> the server has (no might in there) crashed, is scary, when all that's
> happening is an intentional shutdown.  But also the sheer log volume is
> bad - on a busy server there may be a *lot* of these lines.
> 
> I think the log volume is bad even for an actual PANIC. I've spent *way*
> too much time scrolling through pages and pages of the above lines, just
> to find the one or two lines indicating an actual error.
>
> It seems like we ought to be able to somehow
> 
> a) Signal that the server has been shut down in immediate mode, rather
> than actually crashed, and issue a different log message to the user.
> 
> b) Stop issuing, to the server log, the same message over and over. We
> instead just ought to send the message to the client. We however need to
> be careful that we don't make it harder to debug a SIGQUIT sent directly
> to backend processes.

I doubt that different messages for server log and client worth
doing. Isn't it enough moving the cause description from backend
message to postmaster one?

Addition to that, I don't see such a message on connecting psql. It
just reports as "server closed the connection unexpectedly" without a
server message. If I'm not missing something the HINT message is
useless..


  2019-11-05 15:09:52.634 PST [9340][] LOG:  00000: received immediate shutdown request
  2019-11-05 15:09:52.634 PST [9340][] LOCATION:  pmdie, postmaster.c:2883
+ 2019-11-05 15:09:52.634 PST [9340][] LOG:  terminating all active server processes
- 2019-11-05 15:09:52.634 PST [23199][4/0] WARNING:  57P02: terminating connection because of crash of another server
process
+ 2019-11-05 15:09:52.634 PST [23199][4/0] WARNING:  57P02: terminating connection due to command by postmaster
- 2019-11-05 15:09:52.634 PST [23199][1/0] DETAIL:  The postmaster has commanded this server process to roll back the
currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
 
- 2019-11-05 15:09:52.634 PST [23199][1/0] HINT:  In a moment you should be able to reconnect to the database and
repeatyour command.
 

If some process has crashed, it looks like:

2019-11-06 10:47:11.387 JST [19737] LOG:  00000: server process (PID 19774) was terminated by signal 11: Segmentation
fault
2019-11-06 10:47:11.387 JST [19737] DETAIL:  Failed process was running: select pg_sleep(1200);
2019-11-06 10:47:11.387 JST [19737] LOCATION:  LogChildExit, postmaster.c:3749
2019-11-06 10:47:11.387 JST [19737] LOG:  00000: terminating any other active server processes, because a server
processexited abnormally and possibly corrupted shared memory.
 
2019-11-06 10:47:11.387 JST [19737] LOCATION:  HandleChildCrash, postmaster.c:3469
2019-11-06 10:47:11.387 JST [19800] WARNING:  57P02: terminating connection due to command by postmaster
2019-11-06 10:47:11.387 JST [19800] LOCATION:  quickdie, postgres.c:2736

> 2) At the end of crash recovery, and often when the startup processes
> switches between WAL sources, we get scary messages like:
> 
> 2019-11-05 15:20:21.907 PST [23407][] <> <> LOG:  00000: invalid record length at F/48C0A500: wanted 24, got 0
> 2019-11-05 15:20:21.907 PST [23407][] <> <> LOCATION:  ReadRecord, xlog.c:4282
> or
> 2019-11-05 15:35:03.321 PST [28518][1/0] LOG:  invalid resource manager ID 52 at 3/7CD0B1B8
> 
> or any of the other types of xlogreader errors.
> 
> I've seen countless tickets being raised because PG users looked at
> their logs and got scared. We want them to look at the logs, so this
> seems counter-productive.
> 
> It seems we, at the very least, could add an error context or something
> explaining that a LOG message about the log end is to be expected. Or
> perhaps we could should reformulate the message to something like
> 'ELEVEL: reached end of valid WAL at XX/XX'
> 'DETAIL: end determined due to invalid record length: wanted 24, got 0'
> 
> perhaps with a HINT in the elevel < ERROR case indicating that this is
> not a failure.

The proposed message seems far less scary. +1.

> 3) When a standby node is shutdown in immediate mode, we issue:
> 
> 2019-11-05 15:45:58.722 PST [30321][] LOG:  database system was interrupted while in recovery at log time 2019-11-05
15:37:43PST
 
> 2019-11-05 15:45:58.722 PST [30321][] HINT:  If this has occurred more than once some data might be corrupted and you
mightneed to choose an earlier recovery target.
 
> 
> Which imo doesn't really make sense for at least standbys, which are
> expected to be in recovery forever. There's more than enough scenarios
> where one would shut down a standby in immediate mode for good reasons,
> we shouldn't issue such warnings in that case.
> 
> 
> The tricky part in improving this would be how to detect when the
> message should still be issued for a standby. One idea, which is not
> bullet proof but might be good enough, would be to record in the control
> file which position recovery was started from last time, and only issue
> the error when recovery would start from the same point.

Recovery always starts from the latest REDO point. Maybe
MinRecoveryPoint is better for the use.

> I'm sure there are more types of messages in this category, these are
> just the ones I could immediately recall from memory as having scared
> actual users unnecessarily.

I often see inquiries on "FATAL: the database system is starting
up". It is actually FATAL for backends internally but it is also
overly scary for users.

> I don't plan on fixing these immediately myself, even if we were to
> agree on something, so if anybody is interested in helping...

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Should we make scary sounding, but actually routine, errors lessscary?

From
Stephen Frost
Date:
Greetings,

* Andres Freund (andres@anarazel.de) wrote:
> There's a few errors that we issue that are, often, much less bad than
> they sound. The most common cases I immediately can recall are:

Agreed.

> 1) Mentioning crash, once for each backend, when shutting down
> immediately. Currently the log output for that, with just two sessions
> connected, is the following:

Eh, +1 or +2.

> It seems like we ought to be able to somehow
>
> a) Signal that the server has been shut down in immediate mode, rather
> than actually crashed, and issue a different log message to the user.

That'd be nice, but I anticipate the question coming up of "how much is
too much to do before an immediate shutdown?"

> b) Stop issuing, to the server log, the same message over and over. We
> instead just ought to send the message to the client. We however need to
> be careful that we don't make it harder to debug a SIGQUIT sent directly
> to backend processes.

Would be nice if we could improve that, agreed.

> 2) At the end of crash recovery, and often when the startup processes
> switches between WAL sources, we get scary messages like:

+1000

> I've seen countless tickets being raised because PG users looked at
> their logs and got scared. We want them to look at the logs, so this
> seems counter-productive.

Yes, agreed, this happens all the time and would be good to improve.

> It seems we, at the very least, could add an error context or something
> explaining that a LOG message about the log end is to be expected. Or
> perhaps we could should reformulate the message to something like
> 'ELEVEL: reached end of valid WAL at XX/XX'
> 'DETAIL: end determined due to invalid record length: wanted 24, got 0'
>
> perhaps with a HINT in the elevel < ERROR case indicating that this is
> not a failure.

Something like that does look like an improvement.

> 3) When a standby node is shutdown in immediate mode, we issue:
>
> 2019-11-05 15:45:58.722 PST [30321][] LOG:  database system was interrupted while in recovery at log time 2019-11-05
15:37:43PST 
> 2019-11-05 15:45:58.722 PST [30321][] HINT:  If this has occurred more than once some data might be corrupted and you
mightneed to choose an earlier recovery target. 
>
> Which imo doesn't really make sense for at least standbys, which are
> expected to be in recovery forever. There's more than enough scenarios
> where one would shut down a standby in immediate mode for good reasons,
> we shouldn't issue such warnings in that case.
>
>
> The tricky part in improving this would be how to detect when the
> message should still be issued for a standby. One idea, which is not
> bullet proof but might be good enough, would be to record in the control
> file which position recovery was started from last time, and only issue
> the error when recovery would start from the same point.

Yeah... This sounds like it would be more difficult to tackle, though I
agree it'd be nice to improve on this too.

Thanks,

Stephen

Attachment

Re: Should we make scary sounding, but actually routine, errors lessscary?

From
Andres Freund
Date:
Hi,

On 2019-11-05 22:00:58 -0500, Chapman Flack wrote:
> On 11/05/19 18:54, Andres Freund wrote:
> > Hi,
> > 
> > There's a few errors that we issue that are, often, much less bad than
> > they sound. The most common cases I immediately can recall are:
> > 
> > 
> > 1) Mentioning crash, once for each backend, when shutting down
> > immediately. Currently the log output for that, with just two sessions
> > connected, is the following:
> 
> While on the topic ... this may be more a property of particular
> packagings of the server, to run under systemd etc., but often there
> is a process during startup trying periodically to open a connection
> to the server to confirm that it has successfully started, and the
> result is a dozen or so log messages that say "FATAL: the server is
> starting" ... which is amusing once you get what it's doing, but a bit
> disconcerting until then.

I think that is best solved by using pg_ctl's logic to look at the
postmaster state file, rather than connecting before the server is
ready. For one connecting requires to actually be able to connect, which
isn't always a given.  If using pg_ctl is problematic for some reason,
it'd imo be better to extract the relevant logic into its own tool.


> Not sure how that could be changed ... maybe a connection-time option
> trial_connection that would suppress the fatal ereport on rejecting
> the connection?

I think that'd be a recipe for hard to debug issues. Imagine somebody
DOSing the server and setting that option - you'd have no way to
actually see what's happening.

Greetings,

Andres Freund



Re: Should we make scary sounding, but actually routine, errors lessscary?

From
Andres Freund
Date:
Hi,

On 2019-11-06 17:36:09 +0900, Kyotaro Horiguchi wrote:
> At Tue, 5 Nov 2019 15:54:22 -0800, Andres Freund <andres@anarazel.de> wrote in 
> > Hi,
> > 
> > There's a few errors that we issue that are, often, much less bad than
> > they sound. The most common cases I immediately can recall are:
> > 
> > 
> > 1) Mentioning crash, once for each backend, when shutting down
> > immediately. Currently the log output for that, with just two sessions
> > connected, is the following:
> > 
> > 2019-11-05 15:09:52.634 PST [9340][] LOG:  00000: received immediate shutdown request
> > 2019-11-05 15:09:52.634 PST [9340][] LOCATION:  pmdie, postmaster.c:2883
> > 2019-11-05 15:09:52.634 PST [23199][4/0] WARNING:  57P02: terminating connection because of crash of another server
process
> > 2019-11-05 15:09:52.634 PST [23199][4/0] DETAIL:  The postmaster has commanded this server process to roll back the
currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
 
> > 2019-11-05 15:09:52.634 PST [23199][4/0] HINT:  In a moment you should be able to reconnect to the database and
repeatyour command.
 
> > 2019-11-05 15:09:52.634 PST [23199][4/0] LOCATION:  quickdie, postgres.c:2734
> > 2019-11-05 15:09:52.634 PST [23187][3/0] WARNING:  57P02: terminating connection because of crash of another server
process
> > 2019-11-05 15:09:52.634 PST [23187][3/0] DETAIL:  The postmaster has commanded this server process to roll back the
currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
 
> > 2019-11-05 15:09:52.634 PST [23187][3/0] HINT:  In a moment you should be able to reconnect to the database and
repeatyour command.
 
> > 2019-11-05 15:09:52.634 PST [23187][3/0] LOCATION:  quickdie, postgres.c:2734
> > 2019-11-05 15:09:52.634 PST [9345][1/0] WARNING:  57P02: terminating connection because of crash of another server
process
> > 2019-11-05 15:09:52.634 PST [9345][1/0] DETAIL:  The postmaster has commanded this server process to roll back the
currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
 
> > 2019-11-05 15:09:52.634 PST [9345][1/0] HINT:  In a moment you should be able to reconnect to the database and
repeatyour command.
 
> > 2019-11-05 15:09:52.634 PST [9345][1/0] LOCATION:  quickdie, postgres.c:2734
> > 2019-11-05 15:09:52.644 PST [9340][] LOG:  00000: database system is shut down
> > 2019-11-05 15:09:52.644 PST [9340][] LOCATION:  UnlinkLockFiles, miscinit.c:859
> > 
> > (23199, 23187 are backends, 9345 is the autovacuum launcher)
> > 
> > I think there's multiple things wrong with this. For one, reading that
> > the server has (no might in there) crashed, is scary, when all that's
> > happening is an intentional shutdown.  But also the sheer log volume is
> > bad - on a busy server there may be a *lot* of these lines.
> > 
> > I think the log volume is bad even for an actual PANIC. I've spent *way*
> > too much time scrolling through pages and pages of the above lines, just
> > to find the one or two lines indicating an actual error.
> >
> > It seems like we ought to be able to somehow
> > 
> > a) Signal that the server has been shut down in immediate mode, rather
> > than actually crashed, and issue a different log message to the user.
> > 
> > b) Stop issuing, to the server log, the same message over and over. We
> > instead just ought to send the message to the client. We however need to
> > be careful that we don't make it harder to debug a SIGQUIT sent directly
> > to backend processes.
> 
> I doubt that different messages for server log and client worth
> doing. Isn't it enough moving the cause description from backend
> message to postmaster one?

I'm not quite following what you're suggesting?


> Addition to that, I don't see such a message on connecting psql. It
> just reports as "server closed the connection unexpectedly" without a
> server message. If I'm not missing something the HINT message is
> useless..

It depends on the state of the connection IIRC, whether you'll get the
error message or not. There's some cases where we might get notified
about the connection having been closed, without actually reading the
error message.  There were some libpq fixes around this not too long
ago, is it possible that you're running a version of psql linked against
an older libpq version (e.g. from the OS)?


I'm not a fan of the above error message libpq generates - it logs this
in plenty cases where there was a network error. Talking about server
crashes when simple network issues may be the reason imo is not a great
idea.




> > 3) When a standby node is shutdown in immediate mode, we issue:
> > 
> > 2019-11-05 15:45:58.722 PST [30321][] LOG:  database system was interrupted while in recovery at log time
2019-11-0515:37:43 PST
 
> > 2019-11-05 15:45:58.722 PST [30321][] HINT:  If this has occurred more than once some data might be corrupted and
youmight need to choose an earlier recovery target.
 
> > 
> > Which imo doesn't really make sense for at least standbys, which are
> > expected to be in recovery forever. There's more than enough scenarios
> > where one would shut down a standby in immediate mode for good reasons,
> > we shouldn't issue such warnings in that case.
> > 
> > 
> > The tricky part in improving this would be how to detect when the
> > message should still be issued for a standby. One idea, which is not
> > bullet proof but might be good enough, would be to record in the control
> > file which position recovery was started from last time, and only issue
> > the error when recovery would start from the same point.
> 
> Recovery always starts from the latest REDO point.

Well, not quite always, e.g. when restoring from a base backup. But
otherwise, yea. But I'm not quite sure what your point is? We store the
last checkpoint/redo position in the control file, and update it every
checkpoint/restartpoint - so the last REDO pointer stored there, would
not necessarily be the same as the redo pointer we started up from last?

Am I misunderstanding?


> Maybe MinRecoveryPoint is better for the use.

What MinRecoveryPoint gets set is not deterministic - it depends
e.g. which buffers get written out at what time. I don't quite see how
we could make reliable use of it.


> > I'm sure there are more types of messages in this category, these are
> > just the ones I could immediately recall from memory as having scared
> > actual users unnecessarily.
> 
> I often see inquiries on "FATAL: the database system is starting
> up". It is actually FATAL for backends internally but it is also
> overly scary for users.

We could probably just reformulate that error message and help users,
without any larger changes...

Greetings,

Andres Freund