Re: UTC4115FATAL: the database system is in recovery mode - Mailing list pgsql-general

From Mathew Samuel
Subject Re: UTC4115FATAL: the database system is in recovery mode
Date
Msg-id FBB3A126CE548B48A5CD4C867A33258603695E61DA@sottexch7.corp.ad.entrust.com
Whole thread Raw
In response to Re: UTC4115FATAL: the database system is in recovery mode  (Craig Ringer <craig@postnewspapers.com.au>)
Responses Re: UTC4115FATAL: the database system is in recovery mode  (Craig Ringer <craig@postnewspapers.com.au>)
List pgsql-general
Craig Ringer <craig(at)postnewspapers(dot)com(dot)au> writes:
> On 05/30/2011 10:29 PM, Mathew Samuel wrote:
>
>> 2011-03-28 10:44:28 UTC3609HINT: Consider increasing the configuration
>> parameter "checkpoint_segments".
>> 2011-03-28 10:44:38 UTC3609LOG: checkpoints are occurring too
>> frequently (10 seconds apart)
>> 2011-03-28 10:44:38 UTC3609HINT: Consider increasing the configuration
>> parameter "checkpoint_segments".
>> 2011-03-28 10:44:42 UTC3932ERROR: canceling statement due to statement
>> timeout
>> 2011-03-28 10:44:42 UTC3932STATEMENT: vacuum full analyze
>> _zamboni.sl_log_1
>> 2011-03-28 10:44:42 UTC3932PANIC: cannot abort transaction 1827110275,
>> it was already committed
>> 2011-03-28 10:44:42 UTC3566LOG: server process (PID 3932) was
>> terminated by signal 6
>
> Interesting. It almost looks like a VACUUM FULL ANALYZE was cancelled by statement_timeout, couldn't be aborted
(assumingit was in fact 
> 1827110275) and then the backend crashed with a signal 6 (SIGABRT).
> SIGABRT can be caused by an assertion failure, certain fatal aborts in the C library caused by memory allocation
errors,etc. 
>
> Alas, while PostgreSQL may have dumped a core file I doubt there's any debug information in your build. If you do
finda core file for that process ID, it might be worth checking for a debuginfo rpm just in case. 
>
>> In fact those last 3 lines are repeated over and over again repeatedly
>> until "UTC4115FATAL: the database system is in recovery mode" is
>> logged for 4 hours. At some point, 4 hours later of course, it appears
>> that the system recovers.
>
> Wow. Four hours recovery with default checkpoint settings.
>
> Is it possible that the server was completely overloaded and was swapping heavily? That could explain why VACUUM
timedout in the first place, and would explain why it took such a long time to recover. Check your system logs around
thesame time for other indications of excessive load, and check your monitoring history if you have monitoring like
Cactior the like active. 
>
> See if there's anything interesting in the kernel logs too.
>
> Just for completeness, can you send all non-commented-out, non-blank lines in your postgresql.conf ?
>
> $ egrep '^[^#[:space:]]' postgresql.conf |cut -d '#' -f 1
>
> --
> Craig Ringer

Thanks for your help, here are the results from running that you provided to me:
$ egrep '^[^#[:space:]]' postgresql.conf |cut -d '#' -f 1

listen_addresses='localhost'
port=5432
max_connections=200
ssl=off
shared_buffers = 24MB
max_fsm_pages = 153600
log_line_prefix='%t%p'
statement_timeout=60000
datestyle = 'iso, mdy'
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'

And as you anticipated, no core file was dumped on that system.

It is quite possible that the system was under heavy load at the time as it is a heavily used system that would be
proneto periods of strong use. No monitoring (Cacti or otherwise) was active as far as I can gather. 

Not that this is any sort of solution but would one approach be to increase the statement_timeout to greater than 1
minute?I realize that this may hide the symptom but if the customer does hit this issue again I'm just seeing if there
isan option I can change to help prevent a reoccurrence. 

Cheers,
Matt

pgsql-general by date:

Previous
From: "David Johnston"
Date:
Subject: Re: Function Column Expansion Causes Inserts To Fail
Next
From: jlhgis
Date:
Subject: troubles with initdb