Thread: Log file monitoring and event notification

Log file monitoring and event notification

From
Andy Colson
Date:
Hi All.

I've started using replication, and I'd like to monitor my logs for any errors or problems.  I don't want to do it
manually,and I'm not interested in stats (a la PgBadger). 

What I'd like, is the instant PG logs: "FATAL: wal segment already removed" (or some such bad thing), I'd like to get
anemail. 

1st: is anyone using a program that does something like this?  What do you use?  How do you like it?

My thinking has been along these lines:

  + log to syslog doesnt really help, and I recall seeing somewhere "syslog may not capture everything".  I still have
monitoringand log rotation problems. 

  + log to stderr and write my own collector works, but then I have to duplicate what logging_collector already does
(rotating,truncating, age, size, etc).  Too much work. 

  + log with logging_collector, then write a thing to figure out what file its writing to and tail it, watch for
rotation,etc.  This is just messy. 

If there isn't a program already available (which I've searched for, believe me), I'd like to get feedback on extending
logging_collectorwith some lua scriptable event notification. 

Lua is small, fast, and mostly easy to embed.  It would allow an admin to customize whatever kind of monitoring they
want. When an event matches logging_collector would spawn off a different app to handle the event notification.  The
appwould be launched in the background and forgotten about so that logging isn't delayed. 

I'm thinking:

function checkLine(item)
   if item:find('FATAL') then
      launch('/usr/bin/mynotify.pl', item)
   end
end

Logging_collector would then do something like (forgive the perl pseudo code):

... regular log file rotation stuff ..
open OUT
while ($line = <stderr>)
{
   checkLine($line);
   print OUT $line;
}

... etc, etc ...

Lua could also have another handy events defined:
    OnLogRotate(newFile)
    OnStartup()
    OnShutdown()


Lua can also keep state, so maybe you dont want to email on the first FATAL, but on the third.

local cc = 0
function checkLine(item)
   if item:find('FATAL') then
      cc = cc + 1
      if cc > 2 then
        launch('/usr/bin/mynotify.pl', item)
        cc = 0
      end
   end
end

Thoughts?

-Andy


Re: Log file monitoring and event notification

From
"Antman, Jason (CMG-Atlanta)"
Date:
General thought:

It's entirely possible my current Postgres environment is missing
something (I'm an automation engineer, not a DBA - most of my postgres
knowledge has been learned on the job or from Google), but we actively
monitor the receive and replay lag (i.e. comparing
pg_current_xlog_location() on the master to
pg_last_xlog_receive_location() and pg_last_xlog_replay_location() on
the slaves) and alert off of that. We don't use any logs for replication
alerts.

We *do*, however, monitor postgres logs for other things. We use Nagios
(actually Icinga) as our monitoring system, and there's a nice Perl
plugin available online called check_logfiles
(http://exchange.nagios.org/directory/Plugins/Log-Files/check_logfiles/details)
that handles alerting on regular expressions in a log file, and also
very nicely handles file rotation (even compression), and is highly
configurable (including perl hook scripts to run if a match is found).

In the easiest case (like if you're not using a real monitoring system),
you could just configure this script, run it however you want (cron?)
and if it exits non-zero, mail the output.

In terms of embedding things in Postgres, I'm a staunch believer that
for performance and reliability, something like alerting shouldn't be
embedded in the application itself but should be handled by an external
(and easily replace-able) component. It's easy enough to do with
logging_collector, or to do with syslog (AFAIK the worry about not
capturing everything is only if you're shipping syslog over the network,
not if you're running a syslogd on the same host as postgres and writing
the logs locally).

 From a systems management/monitoring standpoint, I'd much rather see
something in postgres that sends detailed, well-structured log messages
to a message queue than put the alerting logic in it (syslog works with
everything, but it's so horribly obsolete).

-Jason

On 04/05/14 11:47, Andy Colson wrote:
> Hi All.
>
> I've started using replication, and I'd like to monitor my logs for
> any errors or problems.  I don't want to do it manually, and I'm not
> interested in stats (a la PgBadger).
>
> What I'd like, is the instant PG logs: "FATAL: wal segment already
> removed" (or some such bad thing), I'd like to get an email.
>
> 1st: is anyone using a program that does something like this? What do
> you use?  How do you like it?
>
> My thinking has been along these lines:
>
>  + log to syslog doesnt really help, and I recall seeing somewhere
> "syslog may not capture everything".  I still have monitoring and log
> rotation problems.
>
>  + log to stderr and write my own collector works, but then I have to
> duplicate what logging_collector already does (rotating, truncating,
> age, size, etc).  Too much work.
>
>  + log with logging_collector, then write a thing to figure out what
> file its writing to and tail it, watch for rotation, etc. This is just
> messy.
>
> If there isn't a program already available (which I've searched for,
> believe me), I'd like to get feedback on extending logging_collector
> with some lua scriptable event notification.
>
> Lua is small, fast, and mostly easy to embed.  It would allow an admin
> to customize whatever kind of monitoring they want.  When an event
> matches logging_collector would spawn off a different app to handle
> the event notification.  The app would be launched in the background
> and forgotten about so that logging isn't delayed.
>
> I'm thinking:
>
> function checkLine(item)
>   if item:find('FATAL') then
>      launch('/usr/bin/mynotify.pl', item)
>   end
> end
>
> Logging_collector would then do something like (forgive the perl
> pseudo code):
>
> ... regular log file rotation stuff ..
> open OUT
> while ($line = <stderr>)
> {
>   checkLine($line);
>   print OUT $line;
> }
>
> ... etc, etc ...
>
> Lua could also have another handy events defined:
>     OnLogRotate(newFile)
>     OnStartup()
>     OnShutdown()
>
>
> Lua can also keep state, so maybe you dont want to email on the first
> FATAL, but on the third.
>
> local cc = 0
> function checkLine(item)
>   if item:find('FATAL') then
>      cc = cc + 1
>      if cc > 2 then
>        launch('/usr/bin/mynotify.pl', item)
>        cc = 0
>      end
>   end
> end
>
> Thoughts?
>
> -Andy
>
>


--

Jason Antman | Systems Engineer | CMGdigital
jason.antman@coxinc.com | p: 678-645-4155


Re: Log file monitoring and event notification

From
bricklen
Date:

On Sat, Apr 5, 2014 at 8:47 AM, Andy Colson <andy@squeakycode.net> wrote:
Hi All.

I've started using replication, and I'd like to monitor my logs for any errors or problems.  I don't want to do it manually, and I'm not interested in stats (a la PgBadger).

What I'd like, is the instant PG logs: "FATAL: wal segment already removed" (or some such bad thing), I'd like to get an email.

1st: is anyone using a program that does something like this?  What do you use?  How do you like it?

Tail 'n' Mail from Bucardo might be what you're after: http://bucardo.org/wiki/Tail_n_mail

Re: Log file monitoring and event notification

From
Steve Crawford
Date:
On 04/05/2014 08:47 AM, Andy Colson wrote:
> Hi All.
>
> I've started using replication, and I'd like to monitor my logs for
> any errors or problems.  I don't want to do it manually, and I'm not
> interested in stats (a la PgBadger).
>
> What I'd like, is the instant PG logs: "FATAL: wal segment already
> removed" (or some such bad thing), I'd like to get an email....

As one component of our monitoring we route logging through syslog which
has all messages go to one location for use by PgBadger and friends and
simultaneously any message with a WARN or higher priority goes to a
separate temporary "postgresql_trouble.log."

A cron-job checks this file periodically (currently we use 5-minutes)
for content. If the file has content the script sends the appropriate
emails and truncates the trouble log.

Cheers,
Steve