Re: UTC4115FATAL: the database system is in recovery mode - Mailing list pgsql-general

From Craig Ringer
Subject Re: UTC4115FATAL: the database system is in recovery mode
Date
Msg-id 4DE42D90.6070201@postnewspapers.com.au
Whole thread Raw
In response to UTC4115FATAL: the database system is in recovery mode  (Mathew Samuel <Mathew.Samuel@entrust.com>)
Responses Re: UTC4115FATAL: the database system is in recovery mode  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: UTC4115FATAL: the database system is in recovery mode  (Mathew Samuel <Mathew.Samuel@entrust.com>)
List pgsql-general
On 05/30/2011 10:29 PM, Mathew Samuel wrote:

> 2011-03-28 10:44:28 UTC3609HINT: Consider increasing the configuration
> parameter "checkpoint_segments".
> 2011-03-28 10:44:38 UTC3609LOG: checkpoints are occurring too frequently
> (10 seconds apart)
> 2011-03-28 10:44:38 UTC3609HINT: Consider increasing the configuration
> parameter "checkpoint_segments".
> 2011-03-28 10:44:42 UTC3932ERROR: canceling statement due to statement
> timeout
> 2011-03-28 10:44:42 UTC3932STATEMENT: vacuum full analyze _zamboni.sl_log_1
> 2011-03-28 10:44:42 UTC3932PANIC: cannot abort transaction 1827110275,
> it was already committed
> 2011-03-28 10:44:42 UTC3566LOG: server process (PID 3932) was terminated
> by signal 6

Interesting. It almost looks like a VACUUM FULL ANALYZE was cancelled by
statement_timeout, couldn't be aborted (assuming it was in fact
1827110275) and then the backend crashed with a signal 6 (SIGABRT).
SIGABRT can be caused by an assertion failure, certain fatal aborts in
the C library caused by memory allocation errors, etc.

Alas, while PostgreSQL may have dumped a core file I doubt there's any
debug information in your build. If you do find a core file for that
process ID, it might be worth checking for a debuginfo rpm just in case.

> In fact those last 3 lines are repeated over and over again repeatedly
> until "UTC4115FATAL: the database system is in recovery mode" is logged
> for 4 hours. At some point, 4 hours later of course, it appears that the
> system recovers.

Wow. Four hours recovery with default checkpoint settings.

Is it possible that the server was completely overloaded and was
swapping heavily? That could explain why VACUUM timed out in the first
place, and would explain why it took such a long time to recover. Check
your system logs around the same time for other indications of excessive
load, and check your monitoring history if you have monitoring like
Cacti or the like active.

See if there's anything interesting in the kernel logs too.

Just for completeness, can you send all non-commented-out, non-blank
lines in your postgresql.conf ?

$ egrep '^[^#[:space:]]' postgresql.conf |cut -d '#' -f 1

--
Craig Ringer

pgsql-general by date:

Previous
From: Tarlika Elisabeth Schmitz
Date:
Subject: Re: trigger - dynamic WHERE clause
Next
From: "David Johnston"
Date:
Subject: [9.1beta1] UTF-8/Regex Word-Character Definition excluding accented letters