Thread: The database system is in recovery mode
Our database just experienced the problem in the subject line. After the error, the database was still up, but would issue the error to any new connections. The stats collector process, a vacuum and one other connection were all in an uninterruptable state and the machine had to be rebooted. Could this be the linux kernel randomly killing processes under heavy load issue? I've seen that happen on other machines before, but in those cases the kernel logged when it was killing processes in syslog... There were no messages in syslog in this case. System is postgresql 7.2.1 on redhat 7.2. Here's the logs: 2003-05-01 16:54:08 DEBUG: server process (pid 2599) was terminated by signal 11 2003-05-01 16:54:08 DEBUG: terminating any other active server processes 2003-05-01 16:54:08 NOTICE: Message from PostgreSQL backend: The Postmaster has informed me that some other backend died abnormally and possibly corrupted shared memory. I have rolled back the current transaction and am going to terminate your database system connection and exit. Please reconnect to the database system and repeat your query. After a bunch of these, the database goes in recovery mode: 2003-05-01 16:54:08 FATAL 1: The database system is in recovery mode Then after the machine is rebooted and while it is starting up, there is these messages: 2003-05-01 17:35:49 DEBUG: ReadRecord: unexpected pageaddr 21/37D94000 in log file 33, segment 63, offset 14237696 2003-05-01 17:35:49 DEBUG: redo done at 21/3FD92564 I presume this is rerunning the WAL? Is the message serious...could there be database corruption or just lost transactions? Thanks for any help. Regards, Trevor Astrope astrope@e-corp.net
double check your hardware, replace RAM and perhaps even hdd. The only time I have experienced such fatal errors, it was a hardware fault. Hurry, before your data gets really corrupted... Regards, Bjoern On Friday, May 02, 2003 12:24 AM [GMT+1=CET], Trevor Astrope <astrope@e-corp.net> wrote: > Our database just experienced the problem in the subject line. After > the > error, the database was still up, but would issue the error to any new > connections. The stats collector process, a vacuum and one other > connection were all in an uninterruptable state and the machine had > to be > rebooted. > > Could this be the linux kernel randomly killing processes under heavy > load issue? I've seen that happen on other machines before, but in > those > cases the kernel logged when it was killing processes in syslog... > There > were no messages in syslog in this case. > > System is postgresql 7.2.1 on redhat 7.2. Here's the logs: > > 2003-05-01 16:54:08 DEBUG: server process (pid 2599) was terminated > by signal 11 > 2003-05-01 16:54:08 DEBUG: terminating any other active server > processes > 2003-05-01 16:54:08 NOTICE: Message from PostgreSQL backend: > The Postmaster has informed me that some other backend > died abnormally and possibly corrupted shared memory. > I have rolled back the current transaction and am > going to terminate your database system connection and exit. > Please reconnect to the database system and repeat your query. > > After a bunch of these, the database goes in recovery mode: > > 2003-05-01 16:54:08 FATAL 1: The database system is in recovery mode > > > Then after the machine is rebooted and while it is starting up, there > is > these messages: > > 2003-05-01 17:35:49 DEBUG: ReadRecord: unexpected pageaddr > 21/37D94000 in log file 33, segment 63, offset 14237696 > 2003-05-01 17:35:49 DEBUG: redo done at 21/3FD92564 > > I presume this is rerunning the WAL? Is the message serious...could > there > be database corruption or just lost transactions? > > > Thanks for any help. > > > Regards, > > Trevor Astrope > astrope@e-corp.net > > > ---------------------------(end of > broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html
On Thu, May 01, 2003 at 06:24:03PM -0400, Trevor Astrope wrote: > Could this be the linux kernel randomly killing processes under heavy > load issue? Not from the look of things. See below. > System is postgresql 7.2.1 on redhat 7.2. Here's the logs: You should really upgrade at least to 7.2.4 (no dump required). 7.2.1 has some nasty bugs. > 2003-05-01 16:54:08 DEBUG: server process (pid 2599) was > terminated by signal 11 ^^ That's not signal 9, so it's not the kernel. Sig 11 is SIGSEV on Linux, which probably means some sort of memory problem. Are you suing ECC RAM for your database? You should. In any case, the first thing I'd do is run memtest86 on it. > 2003-05-01 16:54:08 DEBUG: terminating any other active server processes > 2003-05-01 16:54:08 NOTICE: Message from PostgreSQL backend: > The Postmaster has informed me that some other backend > died abnormally and possibly corrupted shared memory. > I have rolled back the current transaction and am > going to terminate your database system connection and exit. > Please reconnect to the database system and repeat your query. > > After a bunch of these, the database goes in recovery mode: That's what it's supposed to do. It's what WAL buys you. > I presume this is rerunning the WAL? Is the message serious...could there > be database corruption or just lost transactions? Neither, assuming you have good hardware and you're using fsync. WAL is there precisely to make the system crash safe. (Of course, if it's sitting on an ext2 partition and the system goes down hard, you have a different batch of problems. But WAL+fsync protects you from postmaster crashes, and machine crashes if your filesystem is crash-safe.) A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Trevor Astrope <astrope@e-corp.net> writes: > Could this be the linux kernel randomly killing processes under heavy > load issue? I concur with the other respondent who pointed out that the kernel uses signal 9, not 11, when it wants to kill something. A check for marginal hardware seems in order. > Then after the machine is rebooted and while it is starting up, there is > these messages: > 2003-05-01 17:35:49 DEBUG: ReadRecord: unexpected pageaddr 21/37D94000 in log file 33, segment 63, offset 14237696 > 2003-05-01 17:35:49 DEBUG: redo done at 21/3FD92564 > I presume this is rerunning the WAL? Is the message serious...could there > be database corruption or just lost transactions? That message is expected if the old WAL happened to end exactly on a page boundary --- which is somewhat unlikely, but certainly not implausible. I don't think you lost anything. regards, tom lane
In article <20030502141444.GC13419@libertyrms.info>, Andrew Sullivan <andrew@libertyrms.info> wrote: >Neither, assuming you have good hardware and you're using fsync. WAL >is there precisely to make the system crash safe. (Of course, if >it's sitting on an ext2 partition and the system goes down hard, you >have a different batch of problems. But WAL+fsync protects you from >postmaster crashes, and machine crashes if your filesystem is >crash-safe.) You seem to be implying that ext2+fsync is not machine crash safe. Is this really what you are trying to say? If so, could you point to docs that verify that? I could definitely see where ext2 without fsync would leave the system in an strange state, but with fsync it should be fine. mrc -- Mike Castle dalgoda@ix.netcom.com www.netcom.com/~dalgoda/ We are all of us living in the shadow of Manhattan. -- Watchmen fatal ("You are in a maze of twisty compiler features, all different"); -- gcc
On Wed, May 07, 2003 at 06:26:50PM -0700, Mike Castle wrote: > > If so, could you point to docs that verify that? Just the experience of people who have used ext2 and have had failures after a crash. I don't pretend to understand the issues in the filesystems, but there are reports of unrecoverable ext2 errors after a crash. What I have ready about ext2 is that it is not entirely crash safe, even with fsync. But I don't know enough about filesystems to say for sure. A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110