Thread: UTC4115FATAL: the database system is in recovery mode
Hi,
I see the following error as found in pg.log:
UTC4115FATAL: the database system is in recovery mode
Actually that message was logged repeatedly for about 4 hours according to the logs (I don't have access to the system itself, just the logs).
Leading up to that error were the following in pg.log:
2011-03-28 10:44:06 UTC3609LOG: checkpoints are occurring too frequently (11 seconds apart)
2011-03-28 10:44:06 UTC3609HINT: Consider increasing the configuration parameter "checkpoint_segments".
2011-03-28 10:44:18 UTC3609LOG: checkpoints are occurring too frequently (12 seconds apart)
2011-03-28 10:44:18 UTC3609HINT: Consider increasing the configuration parameter "checkpoint_segments".
2011-03-28 10:44:28 UTC3609LOG: checkpoints are occurring too frequently (10 seconds apart)
2011-03-28 10:44:28 UTC3609HINT: Consider increasing the configuration parameter "checkpoint_segments".
2011-03-28 10:44:38 UTC3609LOG: checkpoints are occurring too frequently (10 seconds apart)
2011-03-28 10:44:38 UTC3609HINT: Consider increasing the configuration parameter "checkpoint_segments".
2011-03-28 10:44:42 UTC3932ERROR: canceling statement due to statement timeout
2011-03-28 10:44:42 UTC3932STATEMENT: vacuum full analyze _zamboni.sl_log_1
2011-03-28 10:44:42 UTC3932PANIC: cannot abort transaction 1827110275, it was already committed
2011-03-28 10:44:42 UTC3566LOG: server process (PID 3932) was terminated by signal 6
2011-03-28 10:44:42 UTC3566LOG: terminating any other active server processes
2011-03-28 10:44:42 UTC13142WARNING: terminating connection because of crash of another server process
2011-03-28 10:44:42 UTC13142DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2011-03-28 10:44:42 UTC13142HINT: In a moment you should be able to reconnect to the database and repeat your command.
2011-03-28 10:44:42 UTC29834WARNING: terminating connection because of crash of another server process
2011-03-28 10:44:42 UTC29834DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2011-03-28 10:44:42 UTC29834HINT: In a moment you should be able to reconnect to the database and repeat your command.
2011-03-28 10:44:42 UTC3553WARNING: terminating connection because of crash of another server process
2011-03-28 10:44:42 UTC3553DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2011-03-28 10:44:06 UTC3609HINT: Consider increasing the configuration parameter "checkpoint_segments".
2011-03-28 10:44:18 UTC3609LOG: checkpoints are occurring too frequently (12 seconds apart)
2011-03-28 10:44:18 UTC3609HINT: Consider increasing the configuration parameter "checkpoint_segments".
2011-03-28 10:44:28 UTC3609LOG: checkpoints are occurring too frequently (10 seconds apart)
2011-03-28 10:44:28 UTC3609HINT: Consider increasing the configuration parameter "checkpoint_segments".
2011-03-28 10:44:38 UTC3609LOG: checkpoints are occurring too frequently (10 seconds apart)
2011-03-28 10:44:38 UTC3609HINT: Consider increasing the configuration parameter "checkpoint_segments".
2011-03-28 10:44:42 UTC3932ERROR: canceling statement due to statement timeout
2011-03-28 10:44:42 UTC3932STATEMENT: vacuum full analyze _zamboni.sl_log_1
2011-03-28 10:44:42 UTC3932PANIC: cannot abort transaction 1827110275, it was already committed
2011-03-28 10:44:42 UTC3566LOG: server process (PID 3932) was terminated by signal 6
2011-03-28 10:44:42 UTC3566LOG: terminating any other active server processes
2011-03-28 10:44:42 UTC13142WARNING: terminating connection because of crash of another server process
2011-03-28 10:44:42 UTC13142DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2011-03-28 10:44:42 UTC13142HINT: In a moment you should be able to reconnect to the database and repeat your command.
2011-03-28 10:44:42 UTC29834WARNING: terminating connection because of crash of another server process
2011-03-28 10:44:42 UTC29834DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2011-03-28 10:44:42 UTC29834HINT: In a moment you should be able to reconnect to the database and repeat your command.
2011-03-28 10:44:42 UTC3553WARNING: terminating connection because of crash of another server process
2011-03-28 10:44:42 UTC3553DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
In fact those last 3 lines are repeated over and over again repeatedly until "UTC4115FATAL: the database system is in recovery mode" is logged for 4 hours. At some point, 4 hours later of course, it appears that the system recovers.
The Checkpoints Settings in postgresql.conf are commented out so I guess the defaults are being used:
#checkpoint_segments = 3 # in logfile segments, min 1, 16MB each
#checkpoint_timeout = 5min # range 30s-1h
#checkpoint_warning = 30s # 0 is off
#checkpoint_timeout = 5min # range 30s-1h
#checkpoint_warning = 30s # 0 is off
That system where this was seen was using pgsql-8.2.6 on RHEL4.
Not sure if this is a known bug (or if it is a bug at all or something I can address using different configuration) but I thought I would post here first if any one might be familiar with this issue and suggest a possible solution. Any ideas?
Cheers,
Matt
On 05/30/2011 10:29 PM, Mathew Samuel wrote: > 2011-03-28 10:44:28 UTC3609HINT: Consider increasing the configuration > parameter "checkpoint_segments". > 2011-03-28 10:44:38 UTC3609LOG: checkpoints are occurring too frequently > (10 seconds apart) > 2011-03-28 10:44:38 UTC3609HINT: Consider increasing the configuration > parameter "checkpoint_segments". > 2011-03-28 10:44:42 UTC3932ERROR: canceling statement due to statement > timeout > 2011-03-28 10:44:42 UTC3932STATEMENT: vacuum full analyze _zamboni.sl_log_1 > 2011-03-28 10:44:42 UTC3932PANIC: cannot abort transaction 1827110275, > it was already committed > 2011-03-28 10:44:42 UTC3566LOG: server process (PID 3932) was terminated > by signal 6 Interesting. It almost looks like a VACUUM FULL ANALYZE was cancelled by statement_timeout, couldn't be aborted (assuming it was in fact 1827110275) and then the backend crashed with a signal 6 (SIGABRT). SIGABRT can be caused by an assertion failure, certain fatal aborts in the C library caused by memory allocation errors, etc. Alas, while PostgreSQL may have dumped a core file I doubt there's any debug information in your build. If you do find a core file for that process ID, it might be worth checking for a debuginfo rpm just in case. > In fact those last 3 lines are repeated over and over again repeatedly > until "UTC4115FATAL: the database system is in recovery mode" is logged > for 4 hours. At some point, 4 hours later of course, it appears that the > system recovers. Wow. Four hours recovery with default checkpoint settings. Is it possible that the server was completely overloaded and was swapping heavily? That could explain why VACUUM timed out in the first place, and would explain why it took such a long time to recover. Check your system logs around the same time for other indications of excessive load, and check your monitoring history if you have monitoring like Cacti or the like active. See if there's anything interesting in the kernel logs too. Just for completeness, can you send all non-commented-out, non-blank lines in your postgresql.conf ? $ egrep '^[^#[:space:]]' postgresql.conf |cut -d '#' -f 1 -- Craig Ringer
Craig Ringer <craig@postnewspapers.com.au> writes: > On 05/30/2011 10:29 PM, Mathew Samuel wrote: >> 2011-03-28 10:44:42 UTC3932ERROR: canceling statement due to statement >> timeout >> 2011-03-28 10:44:42 UTC3932STATEMENT: vacuum full analyze _zamboni.sl_log_1 >> 2011-03-28 10:44:42 UTC3932PANIC: cannot abort transaction 1827110275, >> it was already committed >> 2011-03-28 10:44:42 UTC3566LOG: server process (PID 3932) was terminated >> by signal 6 > Interesting. It almost looks like a VACUUM FULL ANALYZE was cancelled by > statement_timeout, couldn't be aborted (assuming it was in fact > 1827110275) and then the backend crashed with a signal 6 (SIGABRT). Yeah, that seems highly likely. There's a long-known problem in the pre-9.0 implementation of VACUUM FULL, that it marks itself as committed well before the vacuuming is actually done. Any error that occurs after that point results in exactly the above symptom. There's a hack solution for that in releases made after Nov 2009, and getting rid of the problem in a cleaner fashion was one of the motivations for replacing the VACUUM FULL implementation in 9.0. But I suppose the OP is running something not too up-to-date :-(. > Wow. Four hours recovery with default checkpoint settings. Ouch ... the reason for that needs investigation. regards, tom lane
Craig Ringer <craig(at)postnewspapers(dot)com(dot)au> writes: > On 05/30/2011 10:29 PM, Mathew Samuel wrote: > >> 2011-03-28 10:44:28 UTC3609HINT: Consider increasing the configuration >> parameter "checkpoint_segments". >> 2011-03-28 10:44:38 UTC3609LOG: checkpoints are occurring too >> frequently (10 seconds apart) >> 2011-03-28 10:44:38 UTC3609HINT: Consider increasing the configuration >> parameter "checkpoint_segments". >> 2011-03-28 10:44:42 UTC3932ERROR: canceling statement due to statement >> timeout >> 2011-03-28 10:44:42 UTC3932STATEMENT: vacuum full analyze >> _zamboni.sl_log_1 >> 2011-03-28 10:44:42 UTC3932PANIC: cannot abort transaction 1827110275, >> it was already committed >> 2011-03-28 10:44:42 UTC3566LOG: server process (PID 3932) was >> terminated by signal 6 > > Interesting. It almost looks like a VACUUM FULL ANALYZE was cancelled by statement_timeout, couldn't be aborted (assumingit was in fact > 1827110275) and then the backend crashed with a signal 6 (SIGABRT). > SIGABRT can be caused by an assertion failure, certain fatal aborts in the C library caused by memory allocation errors,etc. > > Alas, while PostgreSQL may have dumped a core file I doubt there's any debug information in your build. If you do finda core file for that process ID, it might be worth checking for a debuginfo rpm just in case. > >> In fact those last 3 lines are repeated over and over again repeatedly >> until "UTC4115FATAL: the database system is in recovery mode" is >> logged for 4 hours. At some point, 4 hours later of course, it appears >> that the system recovers. > > Wow. Four hours recovery with default checkpoint settings. > > Is it possible that the server was completely overloaded and was swapping heavily? That could explain why VACUUM timedout in the first place, and would explain why it took such a long time to recover. Check your system logs around thesame time for other indications of excessive load, and check your monitoring history if you have monitoring like Cactior the like active. > > See if there's anything interesting in the kernel logs too. > > Just for completeness, can you send all non-commented-out, non-blank lines in your postgresql.conf ? > > $ egrep '^[^#[:space:]]' postgresql.conf |cut -d '#' -f 1 > > -- > Craig Ringer Thanks for your help, here are the results from running that you provided to me: $ egrep '^[^#[:space:]]' postgresql.conf |cut -d '#' -f 1 listen_addresses='localhost' port=5432 max_connections=200 ssl=off shared_buffers = 24MB max_fsm_pages = 153600 log_line_prefix='%t%p' statement_timeout=60000 datestyle = 'iso, mdy' lc_messages = 'C' lc_monetary = 'C' lc_numeric = 'C' lc_time = 'C' And as you anticipated, no core file was dumped on that system. It is quite possible that the system was under heavy load at the time as it is a heavily used system that would be proneto periods of strong use. No monitoring (Cacti or otherwise) was active as far as I can gather. Not that this is any sort of solution but would one approach be to increase the statement_timeout to greater than 1 minute?I realize that this may hide the symptom but if the customer does hit this issue again I'm just seeing if there isan option I can change to help prevent a reoccurrence. Cheers, Matt
On 05/31/2011 10:46 PM, Mathew Samuel wrote: > Thanks for your help, here are the results from running that you provided to me: > $ egrep '^[^#[:space:]]' postgresql.conf |cut -d '#' -f 1 [snip] OK, nothing interesting there. Thanks for posting it, though; I was wondering if there might be a very high work_mem causing a swap storm or something like that at the root of it. > And as you anticipated, no core file was dumped on that system. Given that it sounds like Tom knows what caused it, it turns out not to matter much. Thanks for checking, though. > It is quite possible that the system was under heavy load at the time as it is a heavily used system that would be proneto periods of strong use. No monitoring (Cacti or otherwise) was active as far as I can gather. Getting monitoring in place would be good, as it'll help you plan capacity and identify overload. > Not that this is any sort of solution but would one approach be to increase the statement_timeout to greater than 1 minute?I realize that this may hide the symptom but if the customer does hit this issue again I'm just seeing if there isan option I can change to help prevent a reoccurrence. Update. You're running an ancient point release. The current 8.2 release is 8.2.21, so you are missing a *LOT* of bug fixes and small performance fixes. -- Craig Ringer