Thread: Problems restarting after database crashed (signal 11).

Problems restarting after database crashed (signal 11).

From
Christopher Cashell
Date:
Yesterday, while attempting to access a database, I received errors
saying that the database was innaccessible.  After investigating a
little, I found the following in the PostgreSQL log files:

2004-06-30 08:30:19 [24119] LOG:  checkpoint process (PID 28423) was
terminated by signal 11
2004-06-30 08:30:19 [24119] LOG:  terminating any other active server
processes
2004-06-30 08:30:19 [28383] WARNING:  terminating connection because of
crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back
the curre nt transaction and exit, because another server process exited
abnormally and po ssibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat yo ur command.
2004-06-30 08:30:19 [28362] WARNING:  terminating connection because of
crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back
the curre nt transaction and exit, because another server process exited
abnormally and po ssibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat yo ur command.

The last bit then repeated a few more times, and then:

2004-06-30 08:30:20 [24119] LOG:  all server processes terminated;
reinitializing
2004-06-30 08:30:20 [28424] LOG:  database system was interrupted at 2004-06-30
08:22:23 CDT
2004-06-30 08:30:20 [28424] LOG:  checkpoint record is at 8/77703F9C
2004-06-30 08:30:20 [28424] LOG:  redo record is at 8/775B1D38; undo
record is at 0/0; shutdown FALSE
2004-06-30 08:30:20 [28424] LOG:  next transaction ID: 1638554; next
OID: 1058492
2004-06-30 08:30:20 [28424] LOG:  database system was not properly shut
down; automatic recovery in progress
2004-06-30 08:30:20 [28424] LOG:  redo starts at 8/775B1D38
2004-06-30 08:30:21 [28430] LOG:  connection received: host=[local] port=
2004-06-30 08:30:21 [28430] FATAL:  the database system is starting up
2004-06-30 08:30:38 [28424] LOG:  record with zero length at 8/78855F38
2004-06-30 08:30:38 [28424] LOG:  redo done at 8/78853EE0
2004-06-30 08:31:40 [28449] LOG:  connection received: host=[local] port=
2004-06-30 08:31:40 [28449] FATAL:  the database system is starting up
2004-06-30 08:31:48 [28452] LOG:  connection received: host=[local] port=
2004-06-30 08:31:48 [28452] FATAL:  the database system is starting up
2004-06-30 08:31:53 [28459] LOG:  connection received: host=[local] port=
2004-06-30 08:31:53 [28459] FATAL:  the database system is starting up

And this then continues on and on.  Even 20 minutes later, attempts to
connect to the database were met with the same FATAL error.

Eventually I attempted to shut it down and restart it, however that
failed too.  When I attempted to shut it down, I discovered a hung
'startup subprocess' that can't be killed.

nexus:~# ps aux | grep postgres
postgres 28424  0.0  1.5 16804 3044 pts/313  D    08:35   0:06 postgres:
startup subprocess
nexus:~# kill -9 28424
nexus:~# ps aux | grep postgres
postgres 28424  0.0  1.5 16804 3044 pts/313  D    08:35   0:06 postgres:
startup subprocess
nexus:~#

As soon as I can get physical access to the machine, I'm planning to
reboot it, as I can't think of anything else to do to kill a process
that can't be kill -KILL'ed.

I'm worried that attempting to start the database after rebooting will
fail in the same way, however.  Has anyone seen anything like this
before, or have any ideas on how to proceed?

I'm running on an Intel Pentium Pro box, with Debian/GNU Linux, running
'unstable'.  I'm using PostgreSQL 7.4.3.

Thank you for your help.

--
| Christopher
+------------------------------------------------+
| Here I stand.  I can do no other.              |
+------------------------------------------------+


Re: Problems restarting after database crashed (signal 11).

From
Tom Lane
Date:
Christopher Cashell <topher-pgsql@zyp.org> writes:
> Eventually I attempted to shut it down and restart it, however that
> failed too.  When I attempted to shut it down, I discovered a hung
> 'startup subprocess' that can't be killed.

This is interesting because it seems just about exactly like this
recent Red Hat bug report:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=126885

As I commented there, I think that it must be a kernel or hardware
issue --- Postgres itself can surely not make an unkillable process.
However it's common to see processes that don't respond to kill if
they are stuck inside a kernel I/O request.  That could mean either
unresponsive hardware or a kernel bug.

I wonder whether you have any similarities in hardware or Linux kernel
to the person who filed the above report?

            regards, tom lane

Re: Problems restarting after database crashed (signal 11).

From
Christopher Cashell
Date:
At Wed, 30 Jun 04, Unidentified Flying Banana Tom Lane, said:
> Christopher Cashell <topher-pgsql@zyp.org> writes:
> > Eventually I attempted to shut it down and restart it, however that
> > failed too.  When I attempted to shut it down, I discovered a hung
> > 'startup subprocess' that can't be killed.
>
> This is interesting because it seems just about exactly like this
> recent Red Hat bug report:
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=126885

Hrm.  Yes, it does appear to be a very similar, if not identical, issue.

> As I commented there, I think that it must be a kernel or hardware
> issue --- Postgres itself can surely not make an unkillable process.
> However it's common to see processes that don't respond to kill if
> they are stuck inside a kernel I/O request.  That could mean either
> unresponsive hardware or a kernel bug.

That is somewhat along the lines of what I was thinking, although I have
had no problems like this before.  The machine has been running for over
100 days, and the database as well, without issue.

28424 postgres  18   0 16804 3044  15m D  0.0  1.6   0:06.72 postmaster

Note that it does have a process status of 'D', or uninterruptible
sleep.  That would explain the unkillable part, though I'm curious how
it ended up there.  Unless it just happened to be in a really bad spot
when Posgres segfaulted. . . although, I wouldn't expect that would
affect the 'startup subprocess'.

> I wonder whether you have any similarities in hardware or Linux kernel
> to the person who filed the above report?

Here's all the information I can provide for this machine:

 IBM IntelliStation Z Pro
 Model: 6899-12U
 Dual Pentium Pro 200
 192MB RAM
 4.5 GB IBM SCSI HDD
 9 GB IBM SCSI HDD
 6.4 GB WD HDD

 The database resides on the 4.5 GB SCSI, with the pg_xlog directory
 symlinked from there, and actually existing on the 9GB SCSI.

nexus:~$ uname -a
Linux nexus.zyp.org 2.6.4 #1 SMP Thu Mar 11 14:04:49 CST 2004 i686 GNU/Linux
nexus:~$ uptime
21:15:39 up 107 days, 20:57,  7 users,  load average: 2.04, 2.31, 2.38

If there's any other information I can provide, please let me know.

I'm going to reboot the box right now, and cross my fingers, hoping
it'll come back up. ;-)

>             regards, tom lane

--
| Christopher
+------------------------------------------------+
| Here I stand.  I can do no other.              |
+------------------------------------------------+


Re: Problems restarting after database crashed (signal 11).

From
Tom Lane
Date:
Christopher Cashell <topher-pgsql@zyp.org> writes:
> 28424 postgres  18   0 16804 3044  15m D  0.0  1.6   0:06.72 postmaster

> Note that it does have a process status of 'D', or uninterruptible
> sleep.  That would explain the unkillable part, though I'm curious how
> it ended up there.

Perhaps I'm just of an older generation, but I always thought 'D' meant
"disk I/O wait".  Which definitely is uninterruptible on most Unixen.

            regards, tom lane

Re: Problems restarting after database crashed (signal

From
"Scott Marlowe"
Date:
On Wed, 2004-06-30 at 18:57, Christopher Cashell wrote:
> Yesterday, while attempting to access a database, I received errors
> saying that the database was innaccessible.  After investigating a
> little, I found the following in the PostgreSQL log files:
>
> 2004-06-30 08:30:19 [24119] LOG:  checkpoint process (PID 28423) was
> terminated by signal 11

> Eventually I attempted to shut it down and restart it, however that
> failed too.  When I attempted to shut it down, I discovered a hung
> 'startup subprocess' that can't be killed.
>
> nexus:~# ps aux | grep postgres
> postgres 28424  0.0  1.5 16804 3044 pts/313  D    08:35   0:06 postgres:
> startup subprocess
> nexus:~# kill -9 28424
> nexus:~# ps aux | grep postgres
> postgres 28424  0.0  1.5 16804 3044 pts/313  D    08:35   0:06 postgres:
> startup subprocess
> nexus:~#

The combination of a Sig 11 failure and a process stuck in a D state
makes me lean towards thinking it's bad hardware (CPU or memory).  Have
you tested this machine?


Re: Problems restarting after database crashed (signal

From
"Scott Marlowe"
Date:
On Wed, 2004-06-30 at 21:41, Scott Marlowe wrote:
> On Wed, 2004-06-30 at 18:57, Christopher Cashell wrote:
> > Yesterday, while attempting to access a database, I received errors
> > saying that the database was innaccessible.  After investigating a
> > little, I found the following in the PostgreSQL log files:
> >
> > 2004-06-30 08:30:19 [24119] LOG:  checkpoint process (PID 28423) was
> > terminated by signal 11
>
> > Eventually I attempted to shut it down and restart it, however that
> > failed too.  When I attempted to shut it down, I discovered a hung
> > 'startup subprocess' that can't be killed.
> >
> > nexus:~# ps aux | grep postgres
> > postgres 28424  0.0  1.5 16804 3044 pts/313  D    08:35   0:06 postgres:
> > startup subprocess
> > nexus:~# kill -9 28424
> > nexus:~# ps aux | grep postgres
> > postgres 28424  0.0  1.5 16804 3044 pts/313  D    08:35   0:06 postgres:
> > startup subprocess
> > nexus:~#
>
> The combination of a Sig 11 failure and a process stuck in a D state
> makes me lean towards thinking it's bad hardware (CPU or memory).  Have
> you tested this machine?

Oh, and a possibly buggy kernel or kernel module somewhere as well.
Didn't mean to not say it, and have had problems with some kernels under
heavy parallel loads doing stupid things that look just like this.


Re: Problems restarting after database crashed (signal

From
Christopher Cashell
Date:
At Wed, 30 Jun 04, Unidentified Flying Banana Scott Marlowe, said:
> The combination of a Sig 11 failure and a process stuck in a D state
> makes me lean towards thinking it's bad hardware (CPU or memory).  Have
> you tested this machine?

It's possible that it's bad hardware, as the machine is a little long in
the tooth.  However, at the same time, it is a *very* well tested box.
It's been in production use for 4 years, and I have yet to experience a
significant hardware issue.

Up until this, prior to when I rebooted it a few hours ago over this
issue, it had been running for 110 days, with the database running
(under approximately the same load) for that whole time.

If it *is* a hardware problem, I would have expected it to show up
somewhat sooner than this.

--
| Christopher
+------------------------------------------------+
| Here I stand.  I can do no other.              |
+------------------------------------------------+


Re: Problems restarting after database crashed (signal 11).

From
Christopher Cashell
Date:
At Wed, 30 Jun 04, Unidentified Flying Banana Tom Lane, said:
> Christopher Cashell <topher-pgsql@zyp.org> writes:
> > 28424 postgres  18   0 16804 3044  15m D  0.0  1.6   0:06.72 postmaster
>
> > Note that it does have a process status of 'D', or uninterruptible
> > sleep.  That would explain the unkillable part, though I'm curious how
> > it ended up there.
>
> Perhaps I'm just of an older generation, but I always thought 'D' meant
> "disk I/O wait".  Which definitely is uninterruptible on most Unixen.

I've always heard it as 'uninteruptable sleep', which I understood to be
the generic term description, and that 'disk I/O wait' is the most
common cause for it (but not the only possible cause).

Either way, I think you're right, and I think that is what happened
here.

>             regards, tom lane

--
| Christopher
+------------------------------------------------+
| Here I stand.  I can do no other.              |
+------------------------------------------------+


Re: Problems restarting after database crashed (signal

From
Doug McNaught
Date:
Christopher Cashell <topher-pgsql@zyp.org> writes:

> If it *is* a hardware problem, I would have expected it to show up
> somewhat sooner than this.

Parts do go bad.  Sig11 is often an indication of ether bad RAM

Re: Problems restarting after database crashed (signal

From
Doug McNaught
Date:
Christopher Cashell <topher-pgsql@zyp.org> writes:

> If it *is* a hardware problem, I would have expected it to show up
> somewhat sooner than this.

[sorry, hit Send a bit too early]

Parts do go bad.  Sig11 is often an indication of either bad RAM or an
overheating processor.  Have you checked your fans, especially since
it's an old machine?

-Doug

Re: Problems restarting after database crashed (signal 11).

From
Tom Lane
Date:
Christopher Cashell <topher-pgsql@zyp.org> writes:
> Either way, I think you're right, and I think that is what happened
> here.

So did it come up after you rebooted?  The other guy wasn't having
any luck that way :-(

            regards, tom lane

Re: Problems restarting after database crashed (signal 11).

From
Christopher Cashell
Date:
At Thu, 01 Jul 04, Unidentified Flying Banana Tom Lane, said:
> Christopher Cashell <topher-pgsql@zyp.org> writes:
> > Either way, I think you're right, and I think that is what happened
> > here.
>
> So did it come up after you rebooted?  The other guy wasn't having
> any luck that way :-(

Yep. ;-)

In a testament to PostgreSQL's robustness, after rebooting the machine,
things went mostly[1] well.  Postgres started up without issue, programs
made their connections to the database, and queries were happily made.

Everything looks to be working perfectly now.

And luckily, this machine, though it holds critical data, isn't a
time/immediate access critical machine.  So having it down for a few
hours wasn't any kind of a problem, other than for my blood pressure.

Thank you Tom, and the rest of you, for your help.  It's greatly
appreciated.

>             regards, tom lane


  [1] Upon rebooting, the autovacuum utility kind of harfed.  Here is
      the log entries that it made:

      ---
      [2004-06-30 08:44:53 AM] Failed connection to database template1
      with error: FATAL:  the database system is starting up
      .
      [2004-06-30 08:44:53 AM] Failed connection to database template1
      with error: FATAL:  the database system is starting up
      .
      [2004-06-30 08:44:53 AM] Error: Cannot connect to template1,
      exiting.
      ---

      When I first saw this, the problem seemed to still be there, and I
      saw "FATAL: the database system is starting up" running across my
      brain in bright red letters.

      I took a minute to breath, and luckily realized that at least one
      program (that starts automatically upon reboot) seemed to be
      accessing the database just fine.  At that, I felt a little hope
      return.

      So, I shut down Postgres, then restarted Postgres, then restarted
      the autovacuum utility, and everything worked just peachy.  I'm
      guessing that perhaps the autovacuum tool was trying to connect to
      Postgres while it was replaying the transaction log from not
      having been shut down cleanly, and that's why it choked, but I
      don't know that for sure.


--
| Christopher
+------------------------------------------------+
| Here I stand.  I can do no other.              |
+------------------------------------------------+


Re: Problems restarting after database crashed (signal 11).

From
Tom Lane
Date:
Christopher Cashell <topher-pgsql@zyp.org> writes:
>       [2004-06-30 08:44:53 AM] Failed connection to database template1
>       with error: FATAL:  the database system is starting up
>       .
>       [2004-06-30 08:44:53 AM] Failed connection to database template1
>       with error: FATAL:  the database system is starting up
>       .
>       [2004-06-30 08:44:53 AM] Error: Cannot connect to template1,
>       exiting.

>       So, I shut down Postgres, then restarted Postgres, then restarted
>       the autovacuum utility, and everything worked just peachy.  I'm
>       guessing that perhaps the autovacuum tool was trying to connect to
>       Postgres while it was replaying the transaction log from not
>       having been shut down cleanly, and that's why it choked, but I
>       don't know that for sure.

Yeah, that's what it looks like to me --- autovacuum just a bit too
quick to give up.  You could've just restarted autovacuum once the
database was up.

In 7.5 I think autovacuum will be integrated, and the postmaster won't
bother to start it till the startup sequence is done, so this won't
be an issue.

            regards, tom lane

Re: Problems restarting after database crashed (signal

From
"Matthew T. O'Connor"
Date:
Christopher Cashell wrote:
>   [1] Upon rebooting, the autovacuum utility kind of harfed.  Here is
>       the log entries that it made:
>
>       ---
>       [2004-06-30 08:44:53 AM] Failed connection to database template1
>       with error: FATAL:  the database system is starting up
>       .
>       [2004-06-30 08:44:53 AM] Failed connection to database template1
>       with error: FATAL:  the database system is starting up
>       .
>       [2004-06-30 08:44:53 AM] Error: Cannot connect to template1,
>       exiting.
>       ---
>
>       When I first saw this, the problem seemed to still be there, and I
>       saw "FATAL: the database system is starting up" running across my
>       brain in bright red letters.
>
>       I took a minute to breath, and luckily realized that at least one
>       program (that starts automatically upon reboot) seemed to be
>       accessing the database just fine.  At that, I felt a little hope
>       return.
>
>       So, I shut down Postgres, then restarted Postgres, then restarted
>       the autovacuum utility, and everything worked just peachy.  I'm
>       guessing that perhaps the autovacuum tool was trying to connect to
>       Postgres while it was replaying the transaction log from not
>       having been shut down cleanly, and that's why it choked, but I
>       don't know that for sure.

This is a known issue with the autovacuum contrib module.  This problem
goes away when pg_autovacuum is integrated into the backend.  I
submitted a patch that makes pg_autovacuum a backend process, but so
far, I have yet to get any response from anyone.

Matthew

Re: Problems restarting after database crashed (signal 11).

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Christopher Cashell <topher-pgsql@zyp.org> writes:
> > 28424 postgres  18   0 16804 3044  15m D  0.0  1.6   0:06.72 postmaster
>
> > Note that it does have a process status of 'D', or uninterruptible
> > sleep.  That would explain the unkillable part, though I'm curious how
> > it ended up there.

Is there an NFS server involved? If an NFS server disappears any process
waiting on I/O for it enters disk-wait indefinitely until it reappears.

--
greg

Re: Problems restarting after database crashed (signal 11).

From
Christopher Cashell
Date:
At Sat, 03 Jul 04, Unidentified Flying Banana Greg Stark, said:
> Is there an NFS server involved? If an NFS server disappears any process
> waiting on I/O for it enters disk-wait indefinitely until it reappears.

Nope.  Everything was local on this machine.

> --
> greg

--
| Christopher
+------------------------------------------------+
| Here I stand.  I can do no other.              |
+------------------------------------------------+