Thread: The database system is in recovery mode

The database system is in recovery mode

From
Trevor Astrope
Date:
Our database just experienced the problem in the subject line. After the
error, the database was still up, but would issue the error to any new
connections. The stats collector process, a vacuum and one other
connection were all in an uninterruptable state and the machine had to be
rebooted.

 Could this be the linux kernel randomly killing processes under heavy
load issue?  I've seen that happen on other machines before, but in those
cases the kernel logged when it was killing processes in syslog... There
were no messages in syslog in this case.

System is postgresql 7.2.1 on redhat 7.2. Here's the logs:

2003-05-01 16:54:08 DEBUG:  server process (pid 2599) was terminated by signal 11
2003-05-01 16:54:08 DEBUG:  terminating any other active server processes
2003-05-01 16:54:08 NOTICE:  Message from PostgreSQL backend:
        The Postmaster has informed me that some other backend
        died abnormally and possibly corrupted shared memory.
        I have rolled back the current transaction and am
        going to terminate your database system connection and exit.
        Please reconnect to the database system and repeat your query.

After a bunch of these, the database goes in recovery mode:

2003-05-01 16:54:08 FATAL 1:  The database system is in recovery mode


Then after the machine is rebooted and while it is starting up, there is
these messages:

2003-05-01 17:35:49 DEBUG:  ReadRecord: unexpected pageaddr 21/37D94000 in log file 33, segment 63, offset 14237696
2003-05-01 17:35:49 DEBUG:  redo done at 21/3FD92564

I presume this is rerunning the WAL? Is the message serious...could there
be database corruption or just lost transactions?


Thanks for any help.


Regards,

Trevor Astrope
astrope@e-corp.net


Re: The database system is in recovery mode

From
Björn Metzdorf
Date:
double check your hardware, replace RAM and perhaps even hdd.

The only time I have experienced such fatal errors, it was a hardware fault.
Hurry, before your data gets really corrupted...

Regards,
Bjoern


On Friday, May 02, 2003 12:24 AM [GMT+1=CET],
Trevor Astrope <astrope@e-corp.net> wrote:

> Our database just experienced the problem in the subject line. After
> the
> error, the database was still up, but would issue the error to any new
> connections. The stats collector process, a vacuum and one other
> connection were all in an uninterruptable state and the machine had
> to be
> rebooted.
>
>  Could this be the linux kernel randomly killing processes under heavy
> load issue?  I've seen that happen on other machines before, but in
> those
> cases the kernel logged when it was killing processes in syslog...
> There
> were no messages in syslog in this case.
>
> System is postgresql 7.2.1 on redhat 7.2. Here's the logs:
>
> 2003-05-01 16:54:08 DEBUG:  server process (pid 2599) was terminated
> by signal 11
> 2003-05-01 16:54:08 DEBUG:  terminating any other active server
> processes
> 2003-05-01 16:54:08 NOTICE:  Message from PostgreSQL backend:
>         The Postmaster has informed me that some other backend
>         died abnormally and possibly corrupted shared memory.
>         I have rolled back the current transaction and am
>         going to terminate your database system connection and exit.
>         Please reconnect to the database system and repeat your query.
>
> After a bunch of these, the database goes in recovery mode:
>
> 2003-05-01 16:54:08 FATAL 1:  The database system is in recovery mode
>
>
> Then after the machine is rebooted and while it is starting up, there
> is
> these messages:
>
> 2003-05-01 17:35:49 DEBUG:  ReadRecord: unexpected pageaddr
> 21/37D94000 in log file 33, segment 63, offset 14237696
> 2003-05-01 17:35:49 DEBUG:  redo done at 21/3FD92564
>
> I presume this is rerunning the WAL? Is the message serious...could
> there
> be database corruption or just lost transactions?
>
>
> Thanks for any help.
>
>
> Regards,
>
> Trevor Astrope
> astrope@e-corp.net
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faqs/FAQ.html


Re: The database system is in recovery mode

From
Andrew Sullivan
Date:
On Thu, May 01, 2003 at 06:24:03PM -0400, Trevor Astrope wrote:
>  Could this be the linux kernel randomly killing processes under heavy
> load issue?

Not from the look of things.  See below.

> System is postgresql 7.2.1 on redhat 7.2. Here's the logs:

You should really upgrade at least to 7.2.4 (no dump required).
7.2.1 has some nasty bugs.

> 2003-05-01 16:54:08 DEBUG:  server process (pid 2599) was
> terminated by signal 11
                       ^^

That's not signal 9, so it's not the kernel.  Sig 11 is SIGSEV on
Linux, which probably means some sort of memory problem.  Are you
suing ECC RAM for your database?  You should.  In any case, the first
thing I'd do is run memtest86 on it.


> 2003-05-01 16:54:08 DEBUG:  terminating any other active server processes
> 2003-05-01 16:54:08 NOTICE:  Message from PostgreSQL backend:
>         The Postmaster has informed me that some other backend
>         died abnormally and possibly corrupted shared memory.
>         I have rolled back the current transaction and am
>         going to terminate your database system connection and exit.
>         Please reconnect to the database system and repeat your query.
>
> After a bunch of these, the database goes in recovery mode:

That's what it's supposed to do.  It's what WAL buys you.

> I presume this is rerunning the WAL? Is the message serious...could there
> be database corruption or just lost transactions?

Neither, assuming you have good hardware and you're using fsync.  WAL
is there precisely to make the system crash safe.  (Of course, if
it's sitting on an ext2 partition and the system goes down hard, you
have a different batch of problems.  But WAL+fsync protects you from
postmaster crashes, and machine crashes if your filesystem is
crash-safe.)

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Re: The database system is in recovery mode

From
Tom Lane
Date:
Trevor Astrope <astrope@e-corp.net> writes:
>  Could this be the linux kernel randomly killing processes under heavy
> load issue?

I concur with the other respondent who pointed out that the kernel uses
signal 9, not 11, when it wants to kill something.  A check for marginal
hardware seems in order.

> Then after the machine is rebooted and while it is starting up, there is
> these messages:

> 2003-05-01 17:35:49 DEBUG:  ReadRecord: unexpected pageaddr 21/37D94000 in log file 33, segment 63, offset 14237696
> 2003-05-01 17:35:49 DEBUG:  redo done at 21/3FD92564

> I presume this is rerunning the WAL? Is the message serious...could there
> be database corruption or just lost transactions?

That message is expected if the old WAL happened to end exactly on a
page boundary --- which is somewhat unlikely, but certainly not
implausible.  I don't think you lost anything.

            regards, tom lane


Re: The database system is in recovery mode

From
dalgoda@ix.netcom.com (Mike Castle)
Date:
In article <20030502141444.GC13419@libertyrms.info>,
Andrew Sullivan  <andrew@libertyrms.info> wrote:
>Neither, assuming you have good hardware and you're using fsync.  WAL
>is there precisely to make the system crash safe.  (Of course, if
>it's sitting on an ext2 partition and the system goes down hard, you
>have a different batch of problems.  But WAL+fsync protects you from
>postmaster crashes, and machine crashes if your filesystem is
>crash-safe.)


You seem to be implying that ext2+fsync is not machine crash safe.  Is this
really what you are trying to say?

If so, could you point to docs that verify that?

I could definitely see where ext2 without fsync would leave the system in
an strange state, but with fsync it should be fine.

mrc

--
     Mike Castle      dalgoda@ix.netcom.com      www.netcom.com/~dalgoda/
    We are all of us living in the shadow of Manhattan.  -- Watchmen
fatal ("You are in a maze of twisty compiler features, all different"); -- gcc


Re: The database system is in recovery mode

From
Andrew Sullivan
Date:
On Wed, May 07, 2003 at 06:26:50PM -0700, Mike Castle wrote:
>
> If so, could you point to docs that verify that?

Just the experience of people who have used ext2 and have had
failures after a crash.  I don't pretend to understand the issues in
the filesystems, but there are reports of unrecoverable ext2 errors
after a crash.  What I have ready about ext2 is that it is not
entirely crash safe, even with fsync.  But I don't know enough about
filesystems to say for sure.

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110