Thread: [RFC] Shouldn't we remove annoying FATAL messages from server log?
Hello, My customers and colleagues sometimes (or often?) ask about the following message: FATAL: the database system is starting up This message is often output dozens of times during a failover or PITR. The users seem to be worried because the message level is FATAL and they don't know why such severe message is output in a successful failover and recovery. I can blame the users, because the message is merely a sub-product of pg_ctl's internal ping. Similarly, the below message is output when I stop the standby server normally. Why FATAL as a result of successful operation? I'm afraid DBAs are annoyed by these messages, as system administration software collects ERROR and more severe messages for daily monitoring. FATAL: terminating walreceiver process due to administrator command Shouldn't we lower the severity or avoiding those messages to server log? How about the following measures? 1. FATAL: the database system is starting up 2. FATAL: the database system is shutting down 3. FATAL: the database system is in recovery mode 4. FATAL: sorry, too many clients already Report these as FATAL to the client because the client wants to know the reason. But don't output them to server log because they are not necessary for DBAs (4 is subtle.) 5. FATAL: terminating walreceiver process due to administrator command 6. FATAL: terminating background worker \"%s\" due to administrator command Don't output these to server log. Why are they necessary? For troubleshooting purposes? If necessary, the severity should be LOG (but I wonder why other background processes are not reported...) To suppress server log output, I think we can do as follows. I guess ereport(FATAL) is still needed for easily handling both client report and process termination. log_min_messages = PANIC; ereport(FATAL, (errcode(ERRCODE_CANNOT_CONNECT_NOW), errmsg("the database system is starting up"))); May I hear your opinions? Regards MauMau
On 12/5/13, 10:25 AM, MauMau wrote: > Report these as FATAL to the client because the client wants to know the > reason. But don't output them to server log because they are not > necessary for DBAs Yeah, this is part of a more general problem, which you have characterized correctly: What is fatal (or error, or warning, ...) to the client isn't necessarily fatal (or error, or warning, ...) to the server or DBA. Fixing this would need a larger enhancement of the logging infrastructure. It's been discussed before, but it's a bit of work.
"MauMau" <maumau307@gmail.com> writes: > Shouldn't we lower the severity or avoiding those messages to server log? No. They are FATAL so far as the individual session is concerned. Possibly some documentation effort is needed here, but I don't think any change in the code behavior would be an improvement. > 1. FATAL: the database system is starting up > 2. FATAL: the database system is shutting down > 3. FATAL: the database system is in recovery mode > 4. FATAL: sorry, too many clients already > Report these as FATAL to the client because the client wants to know the > reason. But don't output them to server log because they are not necessary > for DBAs (4 is subtle.) The notion that a DBA should not be allowed to find out how often #4 is happening is insane. regards, tom lane
Tom Lane-2 wrote > "MauMau" < > maumau307@ > > writes: >> Shouldn't we lower the severity or avoiding those messages to server log? > > No. They are FATAL so far as the individual session is concerned. > Possibly some documentation effort is needed here, but I don't think > any change in the code behavior would be an improvement. > >> 1. FATAL: the database system is starting up >> 2. FATAL: the database system is shutting down >> 3. FATAL: the database system is in recovery mode >> 4. FATAL: sorry, too many clients already >> Report these as FATAL to the client because the client wants to know the >> reason. But don't output them to server log because they are not >> necessary >> for DBAs (4 is subtle.) > > The notion that a DBA should not be allowed to find out how often #4 is > happening is insane. Agreed #4 is definitely DBA territory. ISTM that instituting some level of categorization for messages would be helpful. Then logging and reporting frameworks would be able to identify and segregate the logs in whatever way they and the configuration deems appropriate. FATAL: [LOGON] too many clients already I'd make the category output disabled by default for a long while then eventually enabled by default but leave the ability to disable. Calls that do not supply a category get [N/A] output in category mode. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/RFC-Shouldn-t-we-remove-annoying-FATAL-messages-from-server-log-tp5781899p5781925.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
* David Johnston (polobo@yahoo.com) wrote: > ISTM that instituting some level of categorization for messages would be > helpful. Then logging and reporting frameworks would be able to identify > and segregate the logs in whatever way they and the configuration deems > appropriate. I've wanted to do that and have even discussed it with folks in the past, the trick is finding enough toit's, which is difficult when you start to look at the size of the task... Thanks, Stephen
On 12/05/2013 10:21 AM, Stephen Frost wrote: > * David Johnston (polobo@yahoo.com) wrote: >> ISTM that instituting some level of categorization for messages would be >> helpful. Then logging and reporting frameworks would be able to identify >> and segregate the logs in whatever way they and the configuration deems >> appropriate. > > I've wanted to do that and have even discussed it with folks in the > past, the trick is finding enough toit's, which is difficult when you > start to look at the size of the task... But ... if we set a firm policy on this, then we could gradually clean up the error messages piecemeal over the next couple of major versions.We could also make sure that any new features compliedwith the categorization policy. Right now, how to categorize errors is up to each individual patch author, which means that things are all over the place, and get worse with each new feature added. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > On 12/05/2013 10:21 AM, Stephen Frost wrote: > But ... if we set a firm policy on this, then we could gradually clean > up the error messages piecemeal over the next couple of major versions. > We could also make sure that any new features complied with the > categorization policy. > Right now, how to categorize errors is up to each individual patch > author, which means that things are all over the place, and get worse > with each new feature added. I don't think there's that much randomness in is-it-an-ERROR-or-not. What I believe Stephen is talking about is a classification that simply doesn't exist today, namely something around how likely is the message to be of interest to a DBA as opposed to the client application. We currently compose messages almost entirely with the client in mind, and that's as it should be. But we could use some new decoration that's more DBA-oriented to help decide what goes into the postmaster log. Before we could get very far we'd need a better understanding than we have of what cases a DBA might be interested in. To take the specific example that started this thread, there wouldn't be a lot of value IMO in a classification like "connection failure messages". I think the OP is probably right that those are often uninteresting --- but as I mentioned, "too many clients" might become interesting if he's wondering whether he needs to enlarge max_connections. Or password failure cases might become interesting if he starts to suspect breakin attempts. So I'd want to see a design that credibly covers those sorts of needs before we put any large effort into code changes. regards, tom lane
On 12/05/2013 10:46 AM, Tom Lane wrote: > Before we could get very far we'd need a better understanding than we have > of what cases a DBA might be interested in. To take the specific example > that started this thread, there wouldn't be a lot of value IMO in a > classification like "connection failure messages". I think the OP is > probably right that those are often uninteresting --- but as I mentioned, > "too many clients" might become interesting if he's wondering whether he > needs to enlarge max_connections. Or password failure cases might become > interesting if he starts to suspect breakin attempts. So I'd want to see > a design that credibly covers those sorts of needs before we put any large > effort into code changes. Heck, I'd be happy just to have a class of messages which specifically means "OMG, there's something wrong with the server", that is, a flag for messages which only occur when PostgreSQL encounters a bug, data corrpution, or platform error. Right now, I have to suss those out by regex. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Re: Re: [RFC] Shouldn't we remove annoying FATAL messages from server log?
From
Alvaro Herrera
Date:
Josh Berkus wrote: > Heck, I'd be happy just to have a class of messages which specifically > means "OMG, there's something wrong with the server", that is, a flag > for messages which only occur when PostgreSQL encounters a bug, data > corrpution, or platform error. Right now, I have to suss those out by > regex. My proposal was to have something separate from message severity ("criticality"). So the problems would continue to be reported as FATAL, ERROR or WARNING, but if they are just the result of something the user did wrong, then they get marked as "non critical", but if there, say, a failure to flush xlog (which currently results in an ERROR), we could flag it as critical. Grepping the log for critical messages, regardless of severity, would result in actual action items for the DBA, without having to grep things out by regex. Otherwise you have to come up with a lot of messages each keeping the current behavior of abort the current transaction or not, or terminate the process or not. There was also the idea that this would be driven off SQLSTATE but this seems pretty unwieldy to me. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
From: "Peter Eisentraut" <peter_e@gmx.net> > Yeah, this is part of a more general problem, which you have > characterized correctly: What is fatal (or error, or warning, ...) to > the client isn't necessarily fatal (or error, or warning, ...) to the > server or DBA. Thanks. In addition, #5 and #6 in my previous mail are even unnecessary for both the client and the DBA, aren't they? > Fixing this would need a larger enhancement of the > logging infrastructure. It's been discussed before, but it's a bit of > work. How about the easy fix I proposed? The current logging infrastructure seems enough to solve the original problem with small effort without complicating the code. If you don't like "log_min_messages = PANIC", SetConfigOption() can be used instead. I think we'd better take a step to eliminate the facing problem, as well as consider a much richer infrastracture in the long run. I'm also interested in the latter, and want to discuss it after solving the problem in front of me. Regards MauMau
From: "Tom Lane" <tgl@sss.pgh.pa.us> > No. They are FATAL so far as the individual session is concerned. > Possibly some documentation effort is needed here, but I don't think > any change in the code behavior would be an improvement. You are suggesting that we should add a note like "Don't worry about the following message. This is a result of normal connectivity checking", don't you? FATAL: the database system is starting up But I doubt most users would recognize such notes. Anyway, lots of such messages certainly make monitoring and troubleshooting harder, because valuable messages are buried. >> 4. FATAL: sorry, too many clients already >> Report these as FATAL to the client because the client wants to know the >> reason. But don't output them to server log because they are not >> necessary >> for DBAs (4 is subtle.) > > The notion that a DBA should not be allowed to find out how often #4 is > happening is insane. I thought someone would point out so. You are right, #4 is a strong hint for the DBA about max_connection setting or connection pool configuration. Regards MauMau
From: "Josh Berkus" <josh@agliodbs.com> > Heck, I'd be happy just to have a class of messages which specifically > means "OMG, there's something wrong with the server", that is, a flag > for messages which only occur when PostgreSQL encounters a bug, data > corrpution, or platform error. Right now, I have to suss those out by > regex. What are the issues of using SQLSTATE XXnnn as a filter? Regards MauMau
On 2013-12-06 22:35:21 +0900, MauMau wrote: > From: "Tom Lane" <tgl@sss.pgh.pa.us> > >No. They are FATAL so far as the individual session is concerned. > >Possibly some documentation effort is needed here, but I don't think > >any change in the code behavior would be an improvement. > > You are suggesting that we should add a note like "Don't worry about the > following message. This is a result of normal connectivity checking", don't > you? > > FATAL: the database system is starting up Uh. An explanation why you cannot connect to the database hardly seems like a superflous log message. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
From: "Alvaro Herrera" <alvherre@2ndquadrant.com> > There was also the idea that this would be driven off SQLSTATE but this > seems pretty unwieldy to me. You are referring to this long discussion, don't you? http://www.postgresql.org/message-id/19791.1335902957@sss.pgh.pa.us I've read it before, and liked the SQLSTATE-based approach. It seems like properly assigned SQLSTATEs can be used as message IDs, and pairs of SQLSTATE and its user action might be utilized to provide sophisticated database administration GUI. That discussion sounds interesting, and I want to take more time to consider. But what do you think of my original suggestion to easily solve the current issue? I'd like to remove the current annoying problem first before spending much time for more excited infrastructure. Regards MauMau
"MauMau" <maumau307@gmail.com> writes: > That discussion sounds interesting, and I want to take more time to > consider. But what do you think of my original suggestion to easily solve > the current issue? I'd like to remove the current annoying problem first > before spending much time for more excited infrastructure. There is no enthusiasm for a quick-hack solution here, and most people don't actually agree with your proposal that these errors should never get logged. So no, that is not happening. You can hack your local copy that way if you like of course, but it's not getting committed. regards, tom lane
From: "Tom Lane" <tgl@sss.pgh.pa.us> > There is no enthusiasm for a quick-hack solution here, and most people > don't actually agree with your proposal that these errors should never > get logged. So no, that is not happening. You can hack your local > copy that way if you like of course, but it's not getting committed. Oh, I may have misunderstood your previous comments. I got the impression that you and others regard those messages (except "too many clients") as unnecessary in server log. 1. FATAL: the database system is starting up 2. FATAL: the database system is shutting down 3. FATAL: the database system is in recovery mode 5. FATAL: terminating walreceiver process due to administrator command 6. FATAL: terminating background worker \"%s\" due to administrator command Could you tell me why these are necessary in server log? I guess like this. Am I correct? * #1 through #3 are necessary for the DBA to investigate and explain to the end user why he cannot connect to the database. * #4 and #5 are unnecessary for the DBA. I can't find out any reason why these are useful for the DBA. Regards MauMau
MauMau wrote > From: "Tom Lane" < > tgl@.pa > > >> There is no enthusiasm for a quick-hack solution here, and most people >> don't actually agree with your proposal that these errors should never >> get logged. So no, that is not happening. You can hack your local >> copy that way if you like of course, but it's not getting committed. > > Oh, I may have misunderstood your previous comments. I got the impression > that you and others regard those messages (except "too many clients") as > unnecessary in server log. > > 1. FATAL: the database system is starting up > 2. FATAL: the database system is shutting down > 3. FATAL: the database system is in recovery mode > > 5. FATAL: terminating walreceiver process due to administrator command > 6. FATAL: terminating background worker \"%s\" due to administrator > command > > Could you tell me why these are necessary in server log? I guess like > this. > Am I correct? > > * #1 through #3 are necessary for the DBA to investigate and explain to > the > end user why he cannot connect to the database. > > * #4 and #5 are unnecessary for the DBA. I can't find out any reason why > these are useful for the DBA. For me 1-3 are unusual events in production situations and so knowing when they occur, and confirming they occurred for a good reason, is a key job of the DBA. 5 and 6: I don't fully understand when they would happen but likely fall into the same "the DBA should know what is going on with their server and confirm any startup/shutdown activity it is involved with". They might be better categorized "NOTICE" level if they were in response to a administrator action, versus in response to a crashed process, but even for the user-initiated situation making sure they hit the log but using FATAL is totally understandable and IMO desirable. I'd ask in what situations are these messages occurring so frequently that they are becoming noise instead of useful data? Sorry if I missed your use-case explanation up-thread. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/RFC-Shouldn-t-we-remove-annoying-FATAL-messages-from-server-log-tp5781899p5782234.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
David Johnston wrote > > MauMau wrote >> From: "Tom Lane" < >> tgl@.pa >> > >>> There is no enthusiasm for a quick-hack solution here, and most people >>> don't actually agree with your proposal that these errors should never >>> get logged. So no, that is not happening. You can hack your local >>> copy that way if you like of course, but it's not getting committed. >> >> Oh, I may have misunderstood your previous comments. I got the >> impression >> that you and others regard those messages (except "too many clients") as >> unnecessary in server log. >> >> 1. FATAL: the database system is starting up >> 2. FATAL: the database system is shutting down >> 3. FATAL: the database system is in recovery mode >> >> 5. FATAL: terminating walreceiver process due to administrator command >> 6. FATAL: terminating background worker \"%s\" due to administrator >> command >> >> Could you tell me why these are necessary in server log? I guess like >> this. >> Am I correct? >> >> * #1 through #3 are necessary for the DBA to investigate and explain to >> the >> end user why he cannot connect to the database. >> >> * #4 and #5 are unnecessary for the DBA. I can't find out any reason why >> these are useful for the DBA. > For me 1-3 are unusual events in production situations and so knowing when > they occur, and confirming they occurred for a good reason, is a key job > of the DBA. > > 5 and 6: I don't fully understand when they would happen but likely fall > into the same "the DBA should know what is going on with their server and > confirm any startup/shutdown activity it is involved with". > > They might be better categorized "NOTICE" level if they were in response > to a administrator action, versus in response to a crashed process, but > even for the user-initiated situation making sure they hit the log but > using FATAL is totally understandable and IMO desirable. > > I'd ask in what situations are these messages occurring so frequently that > they are becoming noise instead of useful data? Sorry if I missed your > use-case explanation up-thread. > > David J. Went and scanned the thread: PITR/Failover is not generally that frequent an occurrence but I will agree that these events are likely common during such. Maybe PITR/Failover mode can output something in the logs to alleviate user angst about these frequent events? I'm doubting that a totally separate mechanism can be used for this "mode" but instead of looking for things to remove how about adding some additional coddling to the logs and the beginning and end of the mode change? Thought provoking only as I have not actually been a user of said feature. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/RFC-Shouldn-t-we-remove-annoying-FATAL-messages-from-server-log-tp5781899p5782235.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
MauMau wrote > From: "Tom Lane" < > tgl@.pa > > >> There is no enthusiasm for a quick-hack solution here, and most people >> don't actually agree with your proposal that these errors should never >> get logged. So no, that is not happening. You can hack your local >> copy that way if you like of course, but it's not getting committed. > > Oh, I may have misunderstood your previous comments. I got the impression > that you and others regard those messages (except "too many clients") as > unnecessary in server log. > > 1. FATAL: the database system is starting up How about altering the message to tone down the severity by a half-step... FATAL: (probably) not! - the database system is starting up David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/RFC-Shouldn-t-we-remove-annoying-FATAL-messages-from-server-log-tp5781899p5782236.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
From: "David Johnston" <polobo@yahoo.com> >> 5. FATAL: terminating walreceiver process due to administrator command >> 6. FATAL: terminating background worker \"%s\" due to administrator >> command > 5 and 6: I don't fully understand when they would happen but likely fall > into the same "the DBA should know what is going on with their server and > confirm any startup/shutdown activity it is involved with". > > They might be better categorized "NOTICE" level if they were in response > to > a administrator action, versus in response to a crashed process, but even > for the user-initiated situation making sure they hit the log but using > FATAL is totally understandable and IMO desirable. #5 is output when the DBA shuts down the replication standby server. #6 is output when the DBA shuts down the server if he is using any custom background worker. These messages are always output. What I'm seeing as a problem is that FATAL messages are output in a normal situation, which worries the DBA, and those messages don't help the DBA with anything. They merely worry the DBA. Regards MauMau
From: "David Johnston" <polobo@yahoo.com> > PITR/Failover is not generally that frequent an occurrence but I will > agree > that these events are likely common during such. > > Maybe PITR/Failover mode can output something in the logs to alleviate > user > angst about these frequent events? I'm doubting that a totally separate > mechanism can be used for this "mode" but instead of looking for things to > remove how about adding some additional coddling to the logs and the > beginning and end of the mode change? Yes, those messages are not output frequently, because they are only output during planned or unplanned downtime. But frequency does not matter here. Imagine the DBA's heart. They need to perform maintenance or recovery operation in a hurry and wish to not encounter any trouble. They would panic if something goes trouble when he expects to see no trouble. The "FATAL" messages merely make him worried. The extra FATAL messages can be a problem for support staff. What do you feel if the DBA asks you for help when some emergency trouble occurs during recovery, with a few important messages buried in lots of unnecessary FATAL ones? Regards MauMau
On Sat, Dec 7, 2013 at 12:27 AM, David Johnston <polobo@yahoo.com> wrote: >> >> 1. FATAL: the database system is starting up > > How about altering the message to tone down the severity by a half-step... > > FATAL: (probably) not! - the database system is starting up Well it is fatal, the backend for that client doesn't continue. FATAL means a backend died. It is kind of vague how FATAL and PANIC differ. They both sound like the end of the world. If you read FATAL thinking it means the whole service is quitting -- ie what PANIC means -- then these sound like they're noise. -- greg
MauMau wrote > From: "David Johnston" < > polobo@ > > >>> 5. FATAL: terminating walreceiver process due to administrator command >>> 6. FATAL: terminating background worker \"%s\" due to administrator >>> command >> 5 and 6: I don't fully understand when they would happen but likely fall >> into the same "the DBA should know what is going on with their server and >> confirm any startup/shutdown activity it is involved with". >> >> They might be better categorized "NOTICE" level if they were in response >> to >> a administrator action, versus in response to a crashed process, but even >> for the user-initiated situation making sure they hit the log but using >> FATAL is totally understandable and IMO desirable. > > #5 is output when the DBA shuts down the replication standby server. > #6 is output when the DBA shuts down the server if he is using any custom > background worker. > These messages are always output. What I'm seeing as a problem is that > FATAL messages are output in a normal situation, which worries the DBA, > and > those messages don't help the DBA with anything. They merely worry the > DBA. While "worry" is something to be avoided not logging messages is not the correct solution. On the project side looking for ways and places to better communicate to the user is worthwhile in the absence of adding additional complexity to the logging system. The user/DBA, though, is expected to learn how the proces works, ideally in a testing environment where worry is much less a problem, and understand that some of what they are seeing are client-oriented messages that they should be aware of but that do not indicate server-level issues by themselves. They should probably run the log through a filter (a template of which the project should provide) that takes the raw log and filters out those things that, relative to the role and context, are really better classified as NOTICE even if the log-level is shown as FATAL. Raw logs should be verbose so that tools can have more data with which to make decisions. You can always hide what is provided but you can never show was has been omitted. I get that people wil scan raw logs but targeting that lowest-denominator isn't necessarily the best thing to do. Instead, getting those people on board with better practices and showing (and providing) them helpful tools should be the goal. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/RFC-Shouldn-t-we-remove-annoying-FATAL-messages-from-server-log-tp5781899p5782267.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
On 2013-12-07 13:58:11 +0000, Greg Stark wrote: > FATAL means a backend died. It is kind of vague how FATAL and PANIC > differ. I don't really see much vagueness there. FATAL is an unexpected but orderly shutdown. PANIC is for the situations where we can't handle the problem that occurred in any orderly way. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Dec 7, 2013 at 3:56 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > I don't really see much vagueness there. FATAL is an unexpected but > orderly shutdown. PANIC is for the situations where we can't handle the > problem that occurred in any orderly way. Sorry, I was unclear. I meant that without context if someone asked you which was more severe, "fatal" or "panic" you would have no particular way to know. After all, for a person it's easier to cure a panic than a fatality :) On the client end the FATAL is pretty logical but in the logs it makes it sound like the entire server died. Especially in this day of multithreaded servers. I was suggesting that that was the origin of the confusion here. Anyone who has seen these messages on the client end many times might interpret them correctly in the server logs but someone who has only been a DBA, not a database user might never have seen them except in the server logs and without the context might not realize that FATAL is a term of art peculiar to Postgres. -- greg
From: "David Johnston" <polobo@yahoo.com> >>>> 5. FATAL: terminating walreceiver process due to administrator command >>>> 6. FATAL: terminating background worker \"%s\" due to administrator >>>> command >>> 5 and 6: I don't fully understand when they would happen but likely fall >>> into the same "the DBA should know what is going on with their server >>> and >>> confirm any startup/shutdown activity it is involved with". >>> >>> They might be better categorized "NOTICE" level if they were in response >>> to >>> a administrator action, versus in response to a crashed process, but >>> even >>> for the user-initiated situation making sure they hit the log but using >>> FATAL is totally understandable and IMO desirable. >> >> #5 is output when the DBA shuts down the replication standby server. >> #6 is output when the DBA shuts down the server if he is using any custom >> background worker. >> These messages are always output. What I'm seeing as a problem is that >> FATAL messages are output in a normal situation, which worries the DBA, >> and >> those messages don't help the DBA with anything. They merely worry the >> DBA. > > While "worry" is something to be avoided not logging messages is not the > correct solution. On the project side looking for ways and places to > better > communicate to the user is worthwhile in the absence of adding additional > complexity to the logging system. The user/DBA, though, is expected to > learn how the proces works, ideally in a testing environment where worry > is > much less a problem, and understand that some of what they are seeing are > client-oriented messages that they should be aware of but that do not > indicate server-level issues by themselves. I see, It might makes sense to make the DBA learn how the system works, so that more DBAs can solve problems by themselves. However, I wonder how those messages (just stopping server process by admin request) are useful for the DBA. If they are useful, why don't other background processes (e.g. archiver, writer, walsender, etc.) output such a message when stopping? For information, #5 and #6 are not client-oriented. If we want to keep the messages, I think LOG or DEBUG would be appropriate. How about altering "ereport(FATAL, ...)" to "ereport(LOG, ...); proc_exit(1)"? Regards MauMau
From: "Greg Stark" <stark@mit.edu> > On the client end the FATAL is pretty logical but in the logs it makes > it sound like the entire server died. Especially in this day of > multithreaded servers. I was suggesting that that was the origin of > the confusion here. Anyone who has seen these messages on the client > end many times might interpret them correctly in the server logs but > someone who has only been a DBA, not a database user might never have > seen them except in the server logs and without the context might not > realize that FATAL is a term of art peculiar to Postgres. I think so, too. My customers and colleagues, who are relatively new to PostgreSQL, asked like "Is this FATAL message a sign of some problem? What does it mean?" I think it's natural to show concern when finding FATAL messages. I find it unnatural for a normal administration operation to emit a FATAL message. Regards MauMau
MauMau <maumau307@gmail.com> wrote: > From: "Greg Stark" <stark@mit.edu> >> On the client end the FATAL is pretty logical but in the logs it >> makes it sound like the entire server died. I agree that is easily misunderstood, especially since a FATAL problem is less severe than a PANIC; while in common English usage panic is what might feel when faced with the prospect of speaking in public, but fatal generally means something that kills -- like a disease or a plane crash. There is the notion of a "fatal error" in English, though; which means an error which puts an end to what the person who makes such an error is attempting. >> FATAL is a term of art peculiar to Postgres. No, it is not; at least not in terms of being "characteristic of only one person, group, or thing". The Java logger from Apache called log4j has a FATAL level which is more serious than ERROR. The distinction is intended to indicate whether the application is likely to be able to continue running. Similarly, Sybase ASE has severity levels, where above a certain point they are described as "fatal" -- meaning that the application must acquire a new connection. It's probably used elsewhere as well; those are just a couple I happened to be familiar with. > I find it unnatural for a normal administration operation to emit > a FATAL message. It seems to be a fairly common term of art for a problem which requires a restart or reconnection. FATAL is used when the problem is severe enough that the process or connection must end. It seems to me to be what should consistently be used when a client connection or its process must be terminated for a reason other than a client-side request to terminate. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/8/13 3:08 PM, MauMau wrote: >>> >>> #5 is output when the DBA shuts down the replication standby server. >>> #6 is output when the DBA shuts down the server if he is using any custom >>> background worker. >>> These messages are always output. What I'm seeing as a problem is that >>> FATAL messages are output in a normal situation, which worries the DBA, >>> and >>> those messages don't help the DBA with anything. They merely worry the >>> DBA. >> >> While "worry" is something to be avoided not logging messages is not the >> correct solution. On the project side looking for ways and places to better >> communicate to the user is worthwhile in the absence of adding additional >> complexity to the logging system. The user/DBA, though, is expected to >> learn how the proces works, ideally in a testing environment where worry is >> much less a problem, and understand that some of what they are seeing are >> client-oriented messages that they should be aware of but that do not >> indicate server-level issues by themselves. > > I see, It might makes sense to make the DBA learn how the system works, so that more DBAs can solve problems by themselves. However, I wonder how those messages (just stopping server process by admin request) are useful for the DBA. If they are useful, why don't other background processes (e.g. archiver, writer, walsender, etc.) output such a messagewhen stopping? For information, #5 and #6 are not client-oriented. > > If we want to keep the messages, I think LOG or DEBUG would be appropriate. How about altering "ereport(FATAL, ...)" to"ereport(LOG, ...); proc_exit(1)"? IMNSHO, this isn't a "worry" thing, it's a complete mis-categorization of what's happening. Do we output "FATAL" when youdisconnect from the database? Something that is happening *by design* should not be considered as "fatal". I'm not saying we shouldn't log it at all, but it's silly to call it fatal just because the process is going away *when wedid something that needs to make it go away*. What makes it worse is what happens if you do something like pg_terminate_backend() on one of those? That *should* createa fatal error (or at least be treated the same way we treat any other use of pg_terminate_backend()). So in this case I'm betting the real problem is that an orderly shutdown is calling pg_terminate_backend() or the equivalent.While that's convenient from a code standpoint, it's inaccurate from an API standpoint. -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On 12/6/13 7:38 AM, Andres Freund wrote: > On 2013-12-06 22:35:21 +0900, MauMau wrote: >> From: "Tom Lane" <tgl@sss.pgh.pa.us> >>> No. They are FATAL so far as the individual session is concerned. >>> Possibly some documentation effort is needed here, but I don't think >>> any change in the code behavior would be an improvement. >> >> You are suggesting that we should add a note like "Don't worry about the >> following message. This is a result of normal connectivity checking", don't >> you? >> >> FATAL: the database system is starting up > > Uh. An explanation why you cannot connect to the database hardly seems > like a superflous log message. It is when *you* are not actually trying to connect but rather pg_ctl is (which is one of the use cases here). Arguably 1-3 are inaccurate since they're not really about a backend dying... they occur during the startup phase; you nevereven get a functioning backend. That's a marked difference from other uses of FATAL. -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
Jim Nasby <jim@nasby.net> writes: > Arguably 1-3 are inaccurate since they're not really about a backend dying... they occur during the startup phase; younever even get a functioning backend. That's a marked difference from other uses of FATAL. How so? "FATAL" means "an error that terminates your session", which is exactly what these are. regards, tom lane
On 12/9/13 5:56 PM, Tom Lane wrote: > Jim Nasby <jim@nasby.net> writes: >> Arguably 1-3 are inaccurate since they're not really about a backend dying... they occur during the startup phase; younever even get a functioning backend. That's a marked difference from other uses of FATAL. > > How so? "FATAL" means "an error that terminates your session", which > is exactly what these are. Except in these cases the user never actually got a working session; their request was denied. To be clear, from the client standpoint it's certainly fatal, but not from the server's point of view. This is fully expectedbehavior as far as the server is concerned. (Obviously it might be an error that caused the shutdown/recovery, butthat's something different.) -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
Jim Nasby <jim@nasby.net> writes: > On 12/9/13 5:56 PM, Tom Lane wrote: >> How so? "FATAL" means "an error that terminates your session", which >> is exactly what these are. > Except in these cases the user never actually got a working session; their request was denied. > To be clear, from the client standpoint it's certainly fatal, but not from the server's point of view. This is fully expectedbehavior as far as the server is concerned. (Obviously it might be an error that caused the shutdown/recovery, butthat's something different.) Right, but as already pointed out in this thread, these messages are worded from the client's point of view. "The client never got a working connection" seems to me to be an empty distinction. If you got SIGTERM'd before you could issue your first query, should that not be FATAL because you'd not gotten any work done? More generally, we also say FATAL for all sorts of entirely routine connection failures, like wrong password or mistyped user name. People don't seem to have a problem with those. Even if some do complain, the costs of changing that behavior after fifteen-years-and-counting would certainly exceed any benefit. regards, tom lane
On 12/05/2013 11:25 PM, MauMau wrote: > Hello, > > My customers and colleagues sometimes (or often?) ask about the > following message: > > FATAL: the database system is starting up I would LOVE that message to do away, forever. It's a huge PITA for automated log monitoring, analysis, and alerting. The other one I'd personally like to change, but realise is harder to actually do, is to separate "ERROR"s caused by obvious user input issues from internal ERRORs like not finding the backing file for a relation, block read errors, etc. String pattern matching is a crude and awful non-solution, especially given the way PostgreSQL loves to output messages to the log in whatever language and encoding the current database connection is in. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 12/06/2013 03:02 AM, Josh Berkus wrote: > Heck, I'd be happy just to have a class of messages which specifically > means "OMG, there's something wrong with the server", that is, a flag > for messages which only occur when PostgreSQL encounters a bug, data > corrpution, or platform error. Right now, I have to suss those out by > regex. +10 That's what I really need to see in the logs, too. Using SQLState might be reasonable, but I'd be concerned we'd find cases where the same standard-specififed SQLState should be sent to the *client*, despite different underlying causes for the server of different levels of DBA concern. I'd rather not suppress logging of user-level errors, etc; they're important too. It's just necessary to be able to separate them from internal errors that should never happen when the system is operating correctly in order to do effective log-based alerting. After all, I *don't* want to get an SMS whenever the deadlock detector kicks in and someone's created a database in de_DE so the string pattern match doesn't kick in. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Dec 10, 2013 at 08:47:22AM +0800, Craig Ringer wrote: > On 12/05/2013 11:25 PM, MauMau wrote: > > Hello, > > > > My customers and colleagues sometimes (or often?) ask about the > > following message: > > > > FATAL: the database system is starting up > > I would LOVE that message to do away, forever. > > It's a huge PITA for automated log monitoring, analysis, and alerting. > > The other one I'd personally like to change, but realise is harder to > actually do, is to separate "ERROR"s caused by obvious user input issues > from internal ERRORs like not finding the backing file for a relation, > block read errors, etc. > > String pattern matching is a crude and awful non-solution, especially > given the way PostgreSQL loves to output messages to the log in whatever > language and encoding the current database connection is in. Yes, this is certainly a challenge. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
From: "Kevin Grittner" <kgrittn@ymail.com> It seems to be a fairly common term of art for a problem which requires a restart or reconnection. FATAL is used when the problem is severe enough that the process or connection must end. It seems to me to be what should consistently be used when a client connection or its process must be terminated for a reason other than a client-side request to terminate. What do you think of #5 and #6 when matching the above criteria? 5. FATAL: terminating walreceiver process due to administrator command 6. FATAL: terminating background worker \"%s\" due to administrator command These are output when the DBA shuts down the database server and there's no client connection. That is, these don't meet the criteria. I believe these should be suppressed, or use LOG instead of FATAL. Regards MauMau
On 2013-12-12 00:31:25 +0900, MauMau wrote: > What do you think of #5 and #6 when matching the above criteria? > > 5. FATAL: terminating walreceiver process due to administrator command > 6. FATAL: terminating background worker \"%s\" due to administrator command Those are important if they happen outside a shutdown. So, if you really want to remove them from there, you'd need to change the signalling to handle the cases differently. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
MauMau <maumau307@gmail.com> wrote: > From: "Kevin Grittner" <kgrittn@ymail.com> > FATAL is used when the problem is severe enough that the process > or connection must end. It seems to me to be what should > consistently be used when a client connection or its process must > be terminated for a reason other than a client-side request to > terminate. > > What do you think of #5 and #6 when matching the above criteria? > > 5. FATAL: terminating walreceiver process due to administrator > command > 6. FATAL: terminating background worker \"%s\" due to > administrator command Those are client connections and their backends terminated for a reason other than the client side of the connection requesting it. If we don't classify those as FATAL then the definition of FATAL becomes much more fuzzy. What would you define it to mean? -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
From: "Tom Lane" <tgl@sss.pgh.pa.us> > Jim Nasby <jim@nasby.net> writes: >> On 12/9/13 5:56 PM, Tom Lane wrote: >>> How so? "FATAL" means "an error that terminates your session", which >>> is exactly what these are. > >> Except in these cases the user never actually got a working session; >> their request was denied. > >> To be clear, from the client standpoint it's certainly fatal, but not >> from the server's point of view. This is fully expected behavior as far >> as the server is concerned. (Obviously it might be an error that caused >> the shutdown/recovery, but that's something different.) > > Right, but as already pointed out in this thread, these messages are > worded from the client's point of view. "The client never got a working > connection" seems to me to be an empty distinction. If you got SIGTERM'd > before you could issue your first query, should that not be FATAL because > you'd not gotten any work done? > > More generally, we also say FATAL for all sorts of entirely routine > connection failures, like wrong password or mistyped user name. People > don't seem to have a problem with those. Even if some do complain, > the costs of changing that behavior after fifteen-years-and-counting > would certainly exceed any benefit. I agree that #1-#3 are of course reasonable when there's any client the user runs. The problem is that #1 (The database system is starting up) is output in the server log by pg_ctl. In that case, there's no client the user is responsible for. Why does a new DBA have to be worried about that FATAL message? He didn't do anything wrong. I thought adding "options='-c log_min_messages=PANIC'" to the connection string for PQping() in pg_ctl.c would vanish the message, but it didn't. The reason is that connection options take effect in PostgresMain(), which is after checking the FATAL condition in ProcessStartupPacket(). Do you think there is any good solution? Regards MauMau
"MauMau" <maumau307@gmail.com> writes: > I agree that #1-#3 are of course reasonable when there's any client the user > runs. The problem is that #1 (The database system is starting up) is output > in the server log by pg_ctl. In that case, there's no client the user is > responsible for. Why does a new DBA have to be worried about that FATAL > message? He didn't do anything wrong. FATAL doesn't mean "the DBA did something wrong". It means "we terminated a client session". The fundamental problem IMO is that you want to complicate the definition of what these things mean as a substitute for DBAs learning something about Postgres. That seems like a fool's errand from here. They're going to have to learn what FATAL means sooner or later, and making it more complicated just raises the height of that barrier. regards, tom lane
From: "Andres Freund" <andres@2ndquadrant.com> > On 2013-12-12 00:31:25 +0900, MauMau wrote: >> 5. FATAL: terminating walreceiver process due to administrator command >> 6. FATAL: terminating background worker \"%s\" due to administrator >> command > > Those are important if they happen outside a shutdown. So, if you really > want to remove them from there, you'd need to change the signalling to > handle the cases differently. How are they important? If someone mistakenly sends SIGTERM to walreceiver and background workers, they are automatically launched by postmaster or startup process later like other background processes. But other background processes such as walsender, bgwriter, etc. don't emit FATAL messages. Regards MauMau
From: "Kevin Grittner" <kgrittn@ymail.com> > 5. FATAL: terminating walreceiver process due to administrator > command > 6. FATAL: terminating background worker \"%s\" due to > administrator command Those are client connections and their backends terminated for a reason other than the client side of the connection requesting it. If we don't classify those as FATAL then the definition of FATAL becomes much more fuzzy. What would you define it to mean? I'm sorry to cause you trouble, but my understanding is that those are not client connections. They are just background server processes; walreceiver is started by startup process, and background workers are started by extension modules. Am I misunderstanding something? According to Table 18-1 in the manual: http://www.postgresql.org/docs/current/static/runtime-config-logging.html the definition of FATAL is: FATAL Reports an error that caused the current session to abort. This does not apply to the above messages, because there is no error. The DBA just shut down the database server, and the background processes terminated successfully. If some message output is desired, LOG's definition seems the nearest: LOG Reports information of interest to administrators, e.g., checkpoint activity. So, I thought "ereport(LOG, ...); proc_exit(0)" is more appropriate than ereport(FATAL, ...). Is this so strange? Regards MauMau
On 12/11/2013 08:48 AM, Tom Lane wrote: > The fundamental problem IMO is that you want to complicate the definition > of what these things mean as a substitute for DBAs learning something > about Postgres. That seems like a fool's errand from here. They're going > to have to learn what FATAL means sooner or later, and making it more > complicated just raises the height of that barrier. I don't think it works to change the NOTICE/ERROR/FATAL tags; for one thing, I can hear the screaming about people's log scripts from here. However, it would really be useful to have an extra tag (in addition to the ERROR or FATAL) for "If you're seeing this message, something has gone seriously wrong on the server." Just stuff like corruption messages, backend crashes, etc. Otherwise we're requiring users to come up with an alphabet soup of regexes to filter out the noise error messages from the really, really important ones. Speaking as someone who does trainings for new DBAs, the part where I do "what to look for in the logs" requires over an hour and still doesn't cover everything. And doesn't internationalize. That's nasty. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > However, it would really be useful to have an extra tag (in addition to > the ERROR or FATAL) for "If you're seeing this message, something has > gone seriously wrong on the server." Just stuff like corruption > messages, backend crashes, etc. Right, we've discussed that idea elsewhere; there's a basically orthogonal classification that needs to happen. Pretty much all PANICs are high priority from a DBA's perspective, but only a subset of either FATAL or ERROR are. Somebody needs to do the legwork to determine just what kind of classification scheme we want and propose at least an initial set of ereports to be so marked. One thought I had was that we could probably consider the default behavior (in the absence of any call of an explicit criticality-marking function) to be like this: for ereport(), it's critical if a PANIC and otherwise not for elog(), it's critical if >= ERROR level, otherwise not. The rationale for this is that we generally use elog for not-supposed-to-happen cases, so those are probably interesting. If we start getting complaints about some elog not being so interesting, we can convert it to an ereport so it can include an explicit marking call. regards, tom lane