Greetings,
* Andres Freund (andres@anarazel.de) wrote:
> There's a few errors that we issue that are, often, much less bad than
> they sound. The most common cases I immediately can recall are:
Agreed.
> 1) Mentioning crash, once for each backend, when shutting down
> immediately. Currently the log output for that, with just two sessions
> connected, is the following:
Eh, +1 or +2.
> It seems like we ought to be able to somehow
>
> a) Signal that the server has been shut down in immediate mode, rather
> than actually crashed, and issue a different log message to the user.
That'd be nice, but I anticipate the question coming up of "how much is
too much to do before an immediate shutdown?"
> b) Stop issuing, to the server log, the same message over and over. We
> instead just ought to send the message to the client. We however need to
> be careful that we don't make it harder to debug a SIGQUIT sent directly
> to backend processes.
Would be nice if we could improve that, agreed.
> 2) At the end of crash recovery, and often when the startup processes
> switches between WAL sources, we get scary messages like:
+1000
> I've seen countless tickets being raised because PG users looked at
> their logs and got scared. We want them to look at the logs, so this
> seems counter-productive.
Yes, agreed, this happens all the time and would be good to improve.
> It seems we, at the very least, could add an error context or something
> explaining that a LOG message about the log end is to be expected. Or
> perhaps we could should reformulate the message to something like
> 'ELEVEL: reached end of valid WAL at XX/XX'
> 'DETAIL: end determined due to invalid record length: wanted 24, got 0'
>
> perhaps with a HINT in the elevel < ERROR case indicating that this is
> not a failure.
Something like that does look like an improvement.
> 3) When a standby node is shutdown in immediate mode, we issue:
>
> 2019-11-05 15:45:58.722 PST [30321][] LOG: database system was interrupted while in recovery at log time 2019-11-05
15:37:43PST
> 2019-11-05 15:45:58.722 PST [30321][] HINT: If this has occurred more than once some data might be corrupted and you
mightneed to choose an earlier recovery target.
>
> Which imo doesn't really make sense for at least standbys, which are
> expected to be in recovery forever. There's more than enough scenarios
> where one would shut down a standby in immediate mode for good reasons,
> we shouldn't issue such warnings in that case.
>
>
> The tricky part in improving this would be how to detect when the
> message should still be issued for a standby. One idea, which is not
> bullet proof but might be good enough, would be to record in the control
> file which position recovery was started from last time, and only issue
> the error when recovery would start from the same point.
Yeah... This sounds like it would be more difficult to tackle, though I
agree it'd be nice to improve on this too.
Thanks,
Stephen