Re: RFC: pg_stat_logmsg - Mailing list pgsql-hackers

From Joe Conway
Subject Re: RFC: pg_stat_logmsg
Date
Msg-id 3b1683bc-2630-a0d0-6083-8a45aa1b54bf@joeconway.com
Whole thread Raw
In response to Re: RFC: pg_stat_logmsg  (Gurjeet Singh <gurjeet@singh.im>)
Responses Re: RFC: pg_stat_logmsg
List pgsql-hackers
On 7/7/23 01:38, Gurjeet Singh wrote:
> On Thu, Jul 6, 2023 at 12:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Sat, Jul 1, 2023 at 8:57 AM Joe Conway <mail@joeconway.com> wrote:
>> >
>> > The basic idea is to mirror how pg_stat_statements works, except the
>> > logged messages keyed by filename, lineno, and elevel are saved with a
>> > aggregate count. The format string is displayed (similar to a query
>> > jumble) for context, along with function name and sqlerrcode.
>> >
>> >
>> > Part of the thinking is that people with fleets of postgres instances
>> > can use this to scan for various errors that they care about.
>> > Additionally it would be useful to look for "should not happen" errors.
> 
>> I'm concerned that displaying the format string could not be helpful
>> in some cases. For example, when raising an ERROR while reading WAL
>> records, we typically write the error message stored in
>> record->errormsg_buf:
>>
>> in ReadRecord():
>>             if (errormsg)
>>                 ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
>>                         (errmsg_internal("%s", errormsg) /* already
>> translated */ ));
>>
>> In this case, the error message stored in pg_stat_logmsg is just '%s'.
>> The filename and line number columns might help identify the actual
>> error but it requires users to read the source code and cannot know
>> the actual error message.
> 
> I believe that the number of such error messages, the ones with very
> little descriptive content, is very low in Postgres code. These kinds
> of messages are not prevalent, and must be used when code hits obscure
> conditions, like seen in your example above. These are the kinds of
> errors that Joe is referring to as "should not happen". In these
> cases, even if the error message was descriptive, the user will very
> likely have to dive deep into code to find out the real cause.

Agreed. As an example, after running `make installcheck`

8<-----------------
select sum(count) from pg_stat_logmsg;
  sum
------
  6005
(1 row)

select message,sum(count)
from pg_stat_logmsg
where length(message) < 30
   and elevel in ('ERROR','FATAL','PANIC')
   and message like '%\%s%' escape '\'
group by message
order by length(message);
             message            | sum
-------------------------------+-----
  %s                            | 107
  "%s" is a view                |   9
  "%s" is a table               |   4
  %s is a procedure             |   1
  invalid size: "%s"            |  13
  %s at or near "%s"            | 131
  %s at end of input            |   3
...
8<-----------------

I only see three message formats there that are generic enough to be of 
concern (the first and last two shown -- FWIW I did not see any more of 
them as the fmt string gets longer)

So out of 6005 log events, 241 fit this concern.

Perhaps given the small number of message format strings affected, it 
would make sense to special case those, but I am not sure it is worth 
the effort, at least for version 1.

> I feel that the unique combination of file name and line number is
> useful information, even in cases where the format string not very
> descriptive. So I believe the extension's behaviour in this regard is
> acceptable.
> 
> In cases where the extension's output is not descriptive enough, the
> user can use the filename:lineno pair to look for fully formed error
> messages in the actual log files; they may have to make appropriate
> changes to log_* parameters, though.

Right

> If we still feel strongly about getting the actual message for these
> cases, perhaps we can develop a heuristic to identify such messages,
> and use either full or a prefix of the 'message' field, instead of
> 'message_id' field. The heuristic could be: strlen(message_id) is too
> short, or that message_id is all/overwhelmingly format specifiers
> (e.g. '%s: %d').

Based on the above analysis (though granted, not all inclusive), it 
seems like just special casing the specific message format strings of 
interest would work.

> The core benefit of this extension is to make it easy for the user to
> discover error messages affecting their workload. The user may be
> interested in knowing the most common log messages in their server
> logs, so that they can work on reducing those errors or warnings. Or
> they may be interested in knowing when their workload hits
> unexpected/serious error messages, even if they're infrequent. The
> data exposed by pg_stat_logmsg view would serve as a starting point,
> but I'm guessing it would not be sufficient for their investigation.

Yes, exactly.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




pgsql-hackers by date:

Previous
From: Anatoly Zaretsky
Date:
Subject: Re: [PATCH] Remove unnecessary unbind in LDAP search+bind mode
Next
From: reid.thompson@crunchydata.com
Date:
Subject: Re: DecodeInterval fixes