Home > mailing lists

Re: UTC4115FATAL: the database system is in recovery mode - Mailing list pgsql-general

From	Craig Ringer
Subject	Re: UTC4115FATAL: the database system is in recovery mode
Date	May 30, 2011 20:52:11
Msg-id	4DE42D90.6070201@postnewspapers.com.au Whole thread Raw
In response to	UTC4115FATAL: the database system is in recovery mode (Mathew Samuel <Mathew.Samuel@entrust.com>)
Responses	Re: UTC4115FATAL: the database system is in recovery mode Re: UTC4115FATAL: the database system is in recovery mode
List	pgsql-general

Tree view

On 05/30/2011 10:29 PM, Mathew Samuel wrote:

> 2011-03-28 10:44:28 UTC3609HINT: Consider increasing the configuration
> parameter "checkpoint_segments".
> 2011-03-28 10:44:38 UTC3609LOG: checkpoints are occurring too frequently
> (10 seconds apart)
> 2011-03-28 10:44:38 UTC3609HINT: Consider increasing the configuration
> parameter "checkpoint_segments".
> 2011-03-28 10:44:42 UTC3932ERROR: canceling statement due to statement
> timeout
> 2011-03-28 10:44:42 UTC3932STATEMENT: vacuum full analyze _zamboni.sl_log_1
> 2011-03-28 10:44:42 UTC3932PANIC: cannot abort transaction 1827110275,
> it was already committed
> 2011-03-28 10:44:42 UTC3566LOG: server process (PID 3932) was terminated
> by signal 6

Interesting. It almost looks like a VACUUM FULL ANALYZE was cancelled by
statement_timeout, couldn't be aborted (assuming it was in fact
1827110275) and then the backend crashed with a signal 6 (SIGABRT).
SIGABRT can be caused by an assertion failure, certain fatal aborts in
the C library caused by memory allocation errors, etc.

Alas, while PostgreSQL may have dumped a core file I doubt there's any
debug information in your build. If you do find a core file for that
process ID, it might be worth checking for a debuginfo rpm just in case.

> In fact those last 3 lines are repeated over and over again repeatedly
> until "UTC4115FATAL: the database system is in recovery mode" is logged
> for 4 hours. At some point, 4 hours later of course, it appears that the
> system recovers.

Wow. Four hours recovery with default checkpoint settings.

Is it possible that the server was completely overloaded and was
swapping heavily? That could explain why VACUUM timed out in the first
place, and would explain why it took such a long time to recover. Check
your system logs around the same time for other indications of excessive
load, and check your monitoring history if you have monitoring like
Cacti or the like active.

See if there's anything interesting in the kernel logs too.

Just for completeness, can you send all non-commented-out, non-blank
lines in your postgresql.conf ?

$ egrep '^[^#[:space:]]' postgresql.conf |cut -d '#' -f 1

--
Craig Ringer

pgsql-general by date:

From: Tarlika Elisabeth Schmitz
Date: 30 May 2011, 20:38:14
Subject: Re: trigger - dynamic WHERE clause

From: "David Johnston"
Date: 30 May 2011, 22:04:33
Subject: [9.1beta1] UTF-8/Regex Word-Character Definition excluding accented letters

Re: UTC4115FATAL: the database system is in recovery mode - Mailing list pgsql-general

Previous

Next