Thread: UTC4115FATAL: the database system is in recovery mode

UTC4115FATAL: the database system is in recovery mode

From
Mathew Samuel
Date:
Hi,
 
I see the following error as found in pg.log:
UTC4115FATAL:  the database system is in recovery mode
 
Actually that message was logged repeatedly for about 4 hours according to the logs (I don't have access to the system itself, just the logs).
 
Leading up to that error were the following in pg.log:
2011-03-28 10:44:06 UTC3609LOG:  checkpoints are occurring too frequently (11 seconds apart)
2011-03-28 10:44:06 UTC3609HINT:  Consider increasing the configuration parameter "checkpoint_segments".
2011-03-28 10:44:18 UTC3609LOG:  checkpoints are occurring too frequently (12 seconds apart)
2011-03-28 10:44:18 UTC3609HINT:  Consider increasing the configuration parameter "checkpoint_segments".
2011-03-28 10:44:28 UTC3609LOG:  checkpoints are occurring too frequently (10 seconds apart)
2011-03-28 10:44:28 UTC3609HINT:  Consider increasing the configuration parameter "checkpoint_segments".
2011-03-28 10:44:38 UTC3609LOG:  checkpoints are occurring too frequently (10 seconds apart)
2011-03-28 10:44:38 UTC3609HINT:  Consider increasing the configuration parameter "checkpoint_segments".
2011-03-28 10:44:42 UTC3932ERROR:  canceling statement due to statement timeout
2011-03-28 10:44:42 UTC3932STATEMENT:  vacuum full analyze _zamboni.sl_log_1
2011-03-28 10:44:42 UTC3932PANIC:  cannot abort transaction 1827110275, it was already committed
2011-03-28 10:44:42 UTC3566LOG:  server process (PID 3932) was terminated by signal 6
2011-03-28 10:44:42 UTC3566LOG:  terminating any other active server processes
2011-03-28 10:44:42 UTC13142WARNING:  terminating connection because of crash of another server process
2011-03-28 10:44:42 UTC13142DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2011-03-28 10:44:42 UTC13142HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2011-03-28 10:44:42 UTC29834WARNING:  terminating connection because of crash of another server process
2011-03-28 10:44:42 UTC29834DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2011-03-28 10:44:42 UTC29834HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2011-03-28 10:44:42 UTC3553WARNING:  terminating connection because of crash of another server process
2011-03-28 10:44:42 UTC3553DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.

 
In fact those last 3 lines are repeated over and over again repeatedly until "UTC4115FATAL:  the database system is in recovery mode" is logged for 4 hours. At some point, 4 hours later of course, it appears that the system recovers.
 
The Checkpoints Settings in postgresql.conf are commented out so I guess the defaults are being used:
#checkpoint_segments = 3                # in logfile segments, min 1, 16MB each
#checkpoint_timeout = 5min              # range 30s-1h
#checkpoint_warning = 30s               # 0 is off

 
That system where this was seen was using pgsql-8.2.6 on RHEL4.
 
Not sure if this is a known bug (or if it is a bug at all or something I can address using different configuration) but I thought I would post here first if any one might be familiar with this issue and suggest a possible solution. Any ideas?
 
Cheers,
Matt
 

Re: UTC4115FATAL: the database system is in recovery mode

From
Craig Ringer
Date:
On 05/30/2011 10:29 PM, Mathew Samuel wrote:

> 2011-03-28 10:44:28 UTC3609HINT: Consider increasing the configuration
> parameter "checkpoint_segments".
> 2011-03-28 10:44:38 UTC3609LOG: checkpoints are occurring too frequently
> (10 seconds apart)
> 2011-03-28 10:44:38 UTC3609HINT: Consider increasing the configuration
> parameter "checkpoint_segments".
> 2011-03-28 10:44:42 UTC3932ERROR: canceling statement due to statement
> timeout
> 2011-03-28 10:44:42 UTC3932STATEMENT: vacuum full analyze _zamboni.sl_log_1
> 2011-03-28 10:44:42 UTC3932PANIC: cannot abort transaction 1827110275,
> it was already committed
> 2011-03-28 10:44:42 UTC3566LOG: server process (PID 3932) was terminated
> by signal 6

Interesting. It almost looks like a VACUUM FULL ANALYZE was cancelled by
statement_timeout, couldn't be aborted (assuming it was in fact
1827110275) and then the backend crashed with a signal 6 (SIGABRT).
SIGABRT can be caused by an assertion failure, certain fatal aborts in
the C library caused by memory allocation errors, etc.

Alas, while PostgreSQL may have dumped a core file I doubt there's any
debug information in your build. If you do find a core file for that
process ID, it might be worth checking for a debuginfo rpm just in case.

> In fact those last 3 lines are repeated over and over again repeatedly
> until "UTC4115FATAL: the database system is in recovery mode" is logged
> for 4 hours. At some point, 4 hours later of course, it appears that the
> system recovers.

Wow. Four hours recovery with default checkpoint settings.

Is it possible that the server was completely overloaded and was
swapping heavily? That could explain why VACUUM timed out in the first
place, and would explain why it took such a long time to recover. Check
your system logs around the same time for other indications of excessive
load, and check your monitoring history if you have monitoring like
Cacti or the like active.

See if there's anything interesting in the kernel logs too.

Just for completeness, can you send all non-commented-out, non-blank
lines in your postgresql.conf ?

$ egrep '^[^#[:space:]]' postgresql.conf |cut -d '#' -f 1

--
Craig Ringer

Re: UTC4115FATAL: the database system is in recovery mode

From
Tom Lane
Date:
Craig Ringer <craig@postnewspapers.com.au> writes:
> On 05/30/2011 10:29 PM, Mathew Samuel wrote:
>> 2011-03-28 10:44:42 UTC3932ERROR: canceling statement due to statement
>> timeout
>> 2011-03-28 10:44:42 UTC3932STATEMENT: vacuum full analyze _zamboni.sl_log_1
>> 2011-03-28 10:44:42 UTC3932PANIC: cannot abort transaction 1827110275,
>> it was already committed
>> 2011-03-28 10:44:42 UTC3566LOG: server process (PID 3932) was terminated
>> by signal 6

> Interesting. It almost looks like a VACUUM FULL ANALYZE was cancelled by
> statement_timeout, couldn't be aborted (assuming it was in fact
> 1827110275) and then the backend crashed with a signal 6 (SIGABRT).

Yeah, that seems highly likely.  There's a long-known problem in the
pre-9.0 implementation of VACUUM FULL, that it marks itself as committed
well before the vacuuming is actually done.  Any error that occurs after
that point results in exactly the above symptom.  There's a hack
solution for that in releases made after Nov 2009, and getting rid of
the problem in a cleaner fashion was one of the motivations for
replacing the VACUUM FULL implementation in 9.0.  But I suppose the OP
is running something not too up-to-date :-(.

> Wow. Four hours recovery with default checkpoint settings.

Ouch ... the reason for that needs investigation.

            regards, tom lane

Re: UTC4115FATAL: the database system is in recovery mode

From
Mathew Samuel
Date:
Craig Ringer <craig(at)postnewspapers(dot)com(dot)au> writes:
> On 05/30/2011 10:29 PM, Mathew Samuel wrote:
>
>> 2011-03-28 10:44:28 UTC3609HINT: Consider increasing the configuration
>> parameter "checkpoint_segments".
>> 2011-03-28 10:44:38 UTC3609LOG: checkpoints are occurring too
>> frequently (10 seconds apart)
>> 2011-03-28 10:44:38 UTC3609HINT: Consider increasing the configuration
>> parameter "checkpoint_segments".
>> 2011-03-28 10:44:42 UTC3932ERROR: canceling statement due to statement
>> timeout
>> 2011-03-28 10:44:42 UTC3932STATEMENT: vacuum full analyze
>> _zamboni.sl_log_1
>> 2011-03-28 10:44:42 UTC3932PANIC: cannot abort transaction 1827110275,
>> it was already committed
>> 2011-03-28 10:44:42 UTC3566LOG: server process (PID 3932) was
>> terminated by signal 6
>
> Interesting. It almost looks like a VACUUM FULL ANALYZE was cancelled by statement_timeout, couldn't be aborted
(assumingit was in fact 
> 1827110275) and then the backend crashed with a signal 6 (SIGABRT).
> SIGABRT can be caused by an assertion failure, certain fatal aborts in the C library caused by memory allocation
errors,etc. 
>
> Alas, while PostgreSQL may have dumped a core file I doubt there's any debug information in your build. If you do
finda core file for that process ID, it might be worth checking for a debuginfo rpm just in case. 
>
>> In fact those last 3 lines are repeated over and over again repeatedly
>> until "UTC4115FATAL: the database system is in recovery mode" is
>> logged for 4 hours. At some point, 4 hours later of course, it appears
>> that the system recovers.
>
> Wow. Four hours recovery with default checkpoint settings.
>
> Is it possible that the server was completely overloaded and was swapping heavily? That could explain why VACUUM
timedout in the first place, and would explain why it took such a long time to recover. Check your system logs around
thesame time for other indications of excessive load, and check your monitoring history if you have monitoring like
Cactior the like active. 
>
> See if there's anything interesting in the kernel logs too.
>
> Just for completeness, can you send all non-commented-out, non-blank lines in your postgresql.conf ?
>
> $ egrep '^[^#[:space:]]' postgresql.conf |cut -d '#' -f 1
>
> --
> Craig Ringer

Thanks for your help, here are the results from running that you provided to me:
$ egrep '^[^#[:space:]]' postgresql.conf |cut -d '#' -f 1

listen_addresses='localhost'
port=5432
max_connections=200
ssl=off
shared_buffers = 24MB
max_fsm_pages = 153600
log_line_prefix='%t%p'
statement_timeout=60000
datestyle = 'iso, mdy'
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'

And as you anticipated, no core file was dumped on that system.

It is quite possible that the system was under heavy load at the time as it is a heavily used system that would be
proneto periods of strong use. No monitoring (Cacti or otherwise) was active as far as I can gather. 

Not that this is any sort of solution but would one approach be to increase the statement_timeout to greater than 1
minute?I realize that this may hide the symptom but if the customer does hit this issue again I'm just seeing if there
isan option I can change to help prevent a reoccurrence. 

Cheers,
Matt

Re: UTC4115FATAL: the database system is in recovery mode

From
Craig Ringer
Date:
On 05/31/2011 10:46 PM, Mathew Samuel wrote:
> Thanks for your help, here are the results from running that you provided to me:
> $ egrep '^[^#[:space:]]' postgresql.conf |cut -d '#' -f 1

[snip]

OK, nothing interesting there. Thanks for posting it, though; I was
wondering if there might be a very high work_mem causing a swap storm or
something like that at the root of it.

> And as you anticipated, no core file was dumped on that system.

Given that it sounds like Tom knows what caused it, it turns out not to
matter much. Thanks for checking, though.

> It is quite possible that the system was under heavy load at the time as it is a heavily used system that would be
proneto periods of strong use. No monitoring (Cacti or otherwise) was active as far as I can gather. 

Getting monitoring in place would be good, as it'll help you plan
capacity and identify overload.

> Not that this is any sort of solution but would one approach be to increase the statement_timeout to greater than 1
minute?I realize that this may hide the symptom but if the customer does hit this issue again I'm just seeing if there
isan option I can change to help prevent a reoccurrence. 

Update. You're running an ancient point release. The current 8.2 release
is 8.2.21, so you are missing a *LOT* of bug fixes and small performance
fixes.

--
Craig Ringer