Re: Should we make scary sounding, but actually routine, errors lessscary? - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: Should we make scary sounding, but actually routine, errors lessscary?
Date
Msg-id 20191106132159.GS6962@tamriel.snowman.net
Whole thread Raw
In response to Should we make scary sounding, but actually routine, errors lessscary?  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Greetings,

* Andres Freund (andres@anarazel.de) wrote:
> There's a few errors that we issue that are, often, much less bad than
> they sound. The most common cases I immediately can recall are:

Agreed.

> 1) Mentioning crash, once for each backend, when shutting down
> immediately. Currently the log output for that, with just two sessions
> connected, is the following:

Eh, +1 or +2.

> It seems like we ought to be able to somehow
>
> a) Signal that the server has been shut down in immediate mode, rather
> than actually crashed, and issue a different log message to the user.

That'd be nice, but I anticipate the question coming up of "how much is
too much to do before an immediate shutdown?"

> b) Stop issuing, to the server log, the same message over and over. We
> instead just ought to send the message to the client. We however need to
> be careful that we don't make it harder to debug a SIGQUIT sent directly
> to backend processes.

Would be nice if we could improve that, agreed.

> 2) At the end of crash recovery, and often when the startup processes
> switches between WAL sources, we get scary messages like:

+1000

> I've seen countless tickets being raised because PG users looked at
> their logs and got scared. We want them to look at the logs, so this
> seems counter-productive.

Yes, agreed, this happens all the time and would be good to improve.

> It seems we, at the very least, could add an error context or something
> explaining that a LOG message about the log end is to be expected. Or
> perhaps we could should reformulate the message to something like
> 'ELEVEL: reached end of valid WAL at XX/XX'
> 'DETAIL: end determined due to invalid record length: wanted 24, got 0'
>
> perhaps with a HINT in the elevel < ERROR case indicating that this is
> not a failure.

Something like that does look like an improvement.

> 3) When a standby node is shutdown in immediate mode, we issue:
>
> 2019-11-05 15:45:58.722 PST [30321][] LOG:  database system was interrupted while in recovery at log time 2019-11-05
15:37:43PST 
> 2019-11-05 15:45:58.722 PST [30321][] HINT:  If this has occurred more than once some data might be corrupted and you
mightneed to choose an earlier recovery target. 
>
> Which imo doesn't really make sense for at least standbys, which are
> expected to be in recovery forever. There's more than enough scenarios
> where one would shut down a standby in immediate mode for good reasons,
> we shouldn't issue such warnings in that case.
>
>
> The tricky part in improving this would be how to detect when the
> message should still be issued for a standby. One idea, which is not
> bullet proof but might be good enough, would be to record in the control
> file which position recovery was started from last time, and only issue
> the error when recovery would start from the same point.

Yeah... This sounds like it would be more difficult to tackle, though I
agree it'd be nice to improve on this too.

Thanks,

Stephen

Attachment

pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Collation versioning
Next
From: "曾文旌(义从)"
Date:
Subject: Re: [Proposal] Global temporary tables