Re: Re: Industrial-Strength Logging - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Re: Industrial-Strength Logging
Date
Msg-id 4426.960050218@sss.pgh.pa.us
Whole thread Raw
In response to Re: Industrial-Strength Logging  (Giles Lean <giles@nemeton.com.au>)
Responses Re: Re: Industrial-Strength Logging
List pgsql-hackers
Giles Lean <giles@nemeton.com.au> writes:
>> Yeah, let's have another logging discussion... :)
>
> [ good summary of different approaches: ]
>     (a)(i)  standard error to file
>        (ii) standard error piped to a process
>     (b) named log file(s)
>     (c) syslogd
>     (d) database
> I would recommend (a)(ii), with (a)(i) available for anyone who wants
> it.  (Someone who has high load 9-5 but who can shut down daily might
> be happy writing directly to a log file, for example.)

You mentioned the issue of trying to deal with out-of-disk-space errors
for the log file, but there is another kind of resource exhaustion
problem that should also be taken into account.  Namely, inability to
open the log file due to EMFILE (no kernel filetable slots left) errors.
This is fresh in my mind because I just finished making some fixes to
make Postgres more robust in the full-filetable scenario.  It's quite
easy for a Postgres installation to run the kernel out of filetable
slots if the admin has set a large MaxBackends limit without increasing
the kernel's NFILE parameter enough to cope.  So this isn't a very
farfetched scenario, and we ought to take care that our logging
mechanism doesn't break down when it happens.

You mentioned that case (b) has a popular variant of opening and closing
the logfile for each message.  I think this would be the most prone to
EMFILE failures, since the backends wouldn't normally be holding the
logfile open.  In the other cases the logfile or log pipe is held open
continually by each backend so there's no risk at that point.  Of
course, the downstream logging daemon in cases (a)(ii) and (c) might
suffer EMFILE at the time that it's trying to rotate to a new logfile.
I doubt we can expect that syslogd has a good strategy for coping with
this :-(.  If the daemon is of our own making, the first thought that
comes to mind is to hold the previous logfile open until after we
successfully open the new one.  If we get a failure on opening the new
file, we just keep logging into the old one, while periodically trying
to rotate again.

The recovery strategy for individual backends faced with EMFILE failures
is to close inessential files until the open() request succeeds.  (There
are normally plenty of inessential open files, since most backend I/O
goes through VFDs managed by fd.c, and any of those that are physically
open can be closed at need.)  If we use case (b) then a backend that
finds itself unable to open a log file could try to recover that way.
However there are two problems with it: one, we might be unable to log
startup failures under EMFILE conditions (since there might well be no
open VFDs in a newly-started backend, especially if the system is in
filetable trouble), and two, there's some risk of circularity problems
if fd.c is itself trying to write a log message and has to be called
back by elog.c.

Case (d), logging to a database table, would be OK in the face of EMFILE
during normal operation, but again I worry about the prospect of being
unable to log startup failures.  (Actually, there's a more serious
problem with it for startup failures: a backend cannot be expected to do
database writes until it's pretty fully up to speed.  Between that and
the fact the postmaster can't write to tables either, I think we can
reject case (d) for our purposes.)

So from this point of view, it again seems that case (a)(i) or (a)(ii)
is the best alternative, so long as the logging daemon is coded not to
give up its handle for an old log file until it's successfully acquired
a new one.


Seems like the next step should be for someone to take a close look at
the several available log-daemon packages and see which of them looks
like the best bet for our purposes.  (I assume there's no good reason
to roll our own from scratch...)
        regards, tom lane


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: New warning code for missing FROM relations
Next
From: Tom Lane
Date:
Subject: Re: New warning code for missing FROM relations