Thread: Problems restarting after database crashed (signal 11).
Yesterday, while attempting to access a database, I received errors saying that the database was innaccessible. After investigating a little, I found the following in the PostgreSQL log files: 2004-06-30 08:30:19 [24119] LOG: checkpoint process (PID 28423) was terminated by signal 11 2004-06-30 08:30:19 [24119] LOG: terminating any other active server processes 2004-06-30 08:30:19 [28383] WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the curre nt transaction and exit, because another server process exited abnormally and po ssibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat yo ur command. 2004-06-30 08:30:19 [28362] WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the curre nt transaction and exit, because another server process exited abnormally and po ssibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat yo ur command. The last bit then repeated a few more times, and then: 2004-06-30 08:30:20 [24119] LOG: all server processes terminated; reinitializing 2004-06-30 08:30:20 [28424] LOG: database system was interrupted at 2004-06-30 08:22:23 CDT 2004-06-30 08:30:20 [28424] LOG: checkpoint record is at 8/77703F9C 2004-06-30 08:30:20 [28424] LOG: redo record is at 8/775B1D38; undo record is at 0/0; shutdown FALSE 2004-06-30 08:30:20 [28424] LOG: next transaction ID: 1638554; next OID: 1058492 2004-06-30 08:30:20 [28424] LOG: database system was not properly shut down; automatic recovery in progress 2004-06-30 08:30:20 [28424] LOG: redo starts at 8/775B1D38 2004-06-30 08:30:21 [28430] LOG: connection received: host=[local] port= 2004-06-30 08:30:21 [28430] FATAL: the database system is starting up 2004-06-30 08:30:38 [28424] LOG: record with zero length at 8/78855F38 2004-06-30 08:30:38 [28424] LOG: redo done at 8/78853EE0 2004-06-30 08:31:40 [28449] LOG: connection received: host=[local] port= 2004-06-30 08:31:40 [28449] FATAL: the database system is starting up 2004-06-30 08:31:48 [28452] LOG: connection received: host=[local] port= 2004-06-30 08:31:48 [28452] FATAL: the database system is starting up 2004-06-30 08:31:53 [28459] LOG: connection received: host=[local] port= 2004-06-30 08:31:53 [28459] FATAL: the database system is starting up And this then continues on and on. Even 20 minutes later, attempts to connect to the database were met with the same FATAL error. Eventually I attempted to shut it down and restart it, however that failed too. When I attempted to shut it down, I discovered a hung 'startup subprocess' that can't be killed. nexus:~# ps aux | grep postgres postgres 28424 0.0 1.5 16804 3044 pts/313 D 08:35 0:06 postgres: startup subprocess nexus:~# kill -9 28424 nexus:~# ps aux | grep postgres postgres 28424 0.0 1.5 16804 3044 pts/313 D 08:35 0:06 postgres: startup subprocess nexus:~# As soon as I can get physical access to the machine, I'm planning to reboot it, as I can't think of anything else to do to kill a process that can't be kill -KILL'ed. I'm worried that attempting to start the database after rebooting will fail in the same way, however. Has anyone seen anything like this before, or have any ideas on how to proceed? I'm running on an Intel Pentium Pro box, with Debian/GNU Linux, running 'unstable'. I'm using PostgreSQL 7.4.3. Thank you for your help. -- | Christopher +------------------------------------------------+ | Here I stand. I can do no other. | +------------------------------------------------+
Christopher Cashell <topher-pgsql@zyp.org> writes: > Eventually I attempted to shut it down and restart it, however that > failed too. When I attempted to shut it down, I discovered a hung > 'startup subprocess' that can't be killed. This is interesting because it seems just about exactly like this recent Red Hat bug report: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=126885 As I commented there, I think that it must be a kernel or hardware issue --- Postgres itself can surely not make an unkillable process. However it's common to see processes that don't respond to kill if they are stuck inside a kernel I/O request. That could mean either unresponsive hardware or a kernel bug. I wonder whether you have any similarities in hardware or Linux kernel to the person who filed the above report? regards, tom lane
At Wed, 30 Jun 04, Unidentified Flying Banana Tom Lane, said: > Christopher Cashell <topher-pgsql@zyp.org> writes: > > Eventually I attempted to shut it down and restart it, however that > > failed too. When I attempted to shut it down, I discovered a hung > > 'startup subprocess' that can't be killed. > > This is interesting because it seems just about exactly like this > recent Red Hat bug report: > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=126885 Hrm. Yes, it does appear to be a very similar, if not identical, issue. > As I commented there, I think that it must be a kernel or hardware > issue --- Postgres itself can surely not make an unkillable process. > However it's common to see processes that don't respond to kill if > they are stuck inside a kernel I/O request. That could mean either > unresponsive hardware or a kernel bug. That is somewhat along the lines of what I was thinking, although I have had no problems like this before. The machine has been running for over 100 days, and the database as well, without issue. 28424 postgres 18 0 16804 3044 15m D 0.0 1.6 0:06.72 postmaster Note that it does have a process status of 'D', or uninterruptible sleep. That would explain the unkillable part, though I'm curious how it ended up there. Unless it just happened to be in a really bad spot when Posgres segfaulted. . . although, I wouldn't expect that would affect the 'startup subprocess'. > I wonder whether you have any similarities in hardware or Linux kernel > to the person who filed the above report? Here's all the information I can provide for this machine: IBM IntelliStation Z Pro Model: 6899-12U Dual Pentium Pro 200 192MB RAM 4.5 GB IBM SCSI HDD 9 GB IBM SCSI HDD 6.4 GB WD HDD The database resides on the 4.5 GB SCSI, with the pg_xlog directory symlinked from there, and actually existing on the 9GB SCSI. nexus:~$ uname -a Linux nexus.zyp.org 2.6.4 #1 SMP Thu Mar 11 14:04:49 CST 2004 i686 GNU/Linux nexus:~$ uptime 21:15:39 up 107 days, 20:57, 7 users, load average: 2.04, 2.31, 2.38 If there's any other information I can provide, please let me know. I'm going to reboot the box right now, and cross my fingers, hoping it'll come back up. ;-) > regards, tom lane -- | Christopher +------------------------------------------------+ | Here I stand. I can do no other. | +------------------------------------------------+
Christopher Cashell <topher-pgsql@zyp.org> writes: > 28424 postgres 18 0 16804 3044 15m D 0.0 1.6 0:06.72 postmaster > Note that it does have a process status of 'D', or uninterruptible > sleep. That would explain the unkillable part, though I'm curious how > it ended up there. Perhaps I'm just of an older generation, but I always thought 'D' meant "disk I/O wait". Which definitely is uninterruptible on most Unixen. regards, tom lane
On Wed, 2004-06-30 at 18:57, Christopher Cashell wrote: > Yesterday, while attempting to access a database, I received errors > saying that the database was innaccessible. After investigating a > little, I found the following in the PostgreSQL log files: > > 2004-06-30 08:30:19 [24119] LOG: checkpoint process (PID 28423) was > terminated by signal 11 > Eventually I attempted to shut it down and restart it, however that > failed too. When I attempted to shut it down, I discovered a hung > 'startup subprocess' that can't be killed. > > nexus:~# ps aux | grep postgres > postgres 28424 0.0 1.5 16804 3044 pts/313 D 08:35 0:06 postgres: > startup subprocess > nexus:~# kill -9 28424 > nexus:~# ps aux | grep postgres > postgres 28424 0.0 1.5 16804 3044 pts/313 D 08:35 0:06 postgres: > startup subprocess > nexus:~# The combination of a Sig 11 failure and a process stuck in a D state makes me lean towards thinking it's bad hardware (CPU or memory). Have you tested this machine?
On Wed, 2004-06-30 at 21:41, Scott Marlowe wrote: > On Wed, 2004-06-30 at 18:57, Christopher Cashell wrote: > > Yesterday, while attempting to access a database, I received errors > > saying that the database was innaccessible. After investigating a > > little, I found the following in the PostgreSQL log files: > > > > 2004-06-30 08:30:19 [24119] LOG: checkpoint process (PID 28423) was > > terminated by signal 11 > > > Eventually I attempted to shut it down and restart it, however that > > failed too. When I attempted to shut it down, I discovered a hung > > 'startup subprocess' that can't be killed. > > > > nexus:~# ps aux | grep postgres > > postgres 28424 0.0 1.5 16804 3044 pts/313 D 08:35 0:06 postgres: > > startup subprocess > > nexus:~# kill -9 28424 > > nexus:~# ps aux | grep postgres > > postgres 28424 0.0 1.5 16804 3044 pts/313 D 08:35 0:06 postgres: > > startup subprocess > > nexus:~# > > The combination of a Sig 11 failure and a process stuck in a D state > makes me lean towards thinking it's bad hardware (CPU or memory). Have > you tested this machine? Oh, and a possibly buggy kernel or kernel module somewhere as well. Didn't mean to not say it, and have had problems with some kernels under heavy parallel loads doing stupid things that look just like this.
At Wed, 30 Jun 04, Unidentified Flying Banana Scott Marlowe, said: > The combination of a Sig 11 failure and a process stuck in a D state > makes me lean towards thinking it's bad hardware (CPU or memory). Have > you tested this machine? It's possible that it's bad hardware, as the machine is a little long in the tooth. However, at the same time, it is a *very* well tested box. It's been in production use for 4 years, and I have yet to experience a significant hardware issue. Up until this, prior to when I rebooted it a few hours ago over this issue, it had been running for 110 days, with the database running (under approximately the same load) for that whole time. If it *is* a hardware problem, I would have expected it to show up somewhat sooner than this. -- | Christopher +------------------------------------------------+ | Here I stand. I can do no other. | +------------------------------------------------+
At Wed, 30 Jun 04, Unidentified Flying Banana Tom Lane, said: > Christopher Cashell <topher-pgsql@zyp.org> writes: > > 28424 postgres 18 0 16804 3044 15m D 0.0 1.6 0:06.72 postmaster > > > Note that it does have a process status of 'D', or uninterruptible > > sleep. That would explain the unkillable part, though I'm curious how > > it ended up there. > > Perhaps I'm just of an older generation, but I always thought 'D' meant > "disk I/O wait". Which definitely is uninterruptible on most Unixen. I've always heard it as 'uninteruptable sleep', which I understood to be the generic term description, and that 'disk I/O wait' is the most common cause for it (but not the only possible cause). Either way, I think you're right, and I think that is what happened here. > regards, tom lane -- | Christopher +------------------------------------------------+ | Here I stand. I can do no other. | +------------------------------------------------+
Christopher Cashell <topher-pgsql@zyp.org> writes: > If it *is* a hardware problem, I would have expected it to show up > somewhat sooner than this. Parts do go bad. Sig11 is often an indication of ether bad RAM
Christopher Cashell <topher-pgsql@zyp.org> writes: > If it *is* a hardware problem, I would have expected it to show up > somewhat sooner than this. [sorry, hit Send a bit too early] Parts do go bad. Sig11 is often an indication of either bad RAM or an overheating processor. Have you checked your fans, especially since it's an old machine? -Doug
Christopher Cashell <topher-pgsql@zyp.org> writes: > Either way, I think you're right, and I think that is what happened > here. So did it come up after you rebooted? The other guy wasn't having any luck that way :-( regards, tom lane
At Thu, 01 Jul 04, Unidentified Flying Banana Tom Lane, said: > Christopher Cashell <topher-pgsql@zyp.org> writes: > > Either way, I think you're right, and I think that is what happened > > here. > > So did it come up after you rebooted? The other guy wasn't having > any luck that way :-( Yep. ;-) In a testament to PostgreSQL's robustness, after rebooting the machine, things went mostly[1] well. Postgres started up without issue, programs made their connections to the database, and queries were happily made. Everything looks to be working perfectly now. And luckily, this machine, though it holds critical data, isn't a time/immediate access critical machine. So having it down for a few hours wasn't any kind of a problem, other than for my blood pressure. Thank you Tom, and the rest of you, for your help. It's greatly appreciated. > regards, tom lane [1] Upon rebooting, the autovacuum utility kind of harfed. Here is the log entries that it made: --- [2004-06-30 08:44:53 AM] Failed connection to database template1 with error: FATAL: the database system is starting up . [2004-06-30 08:44:53 AM] Failed connection to database template1 with error: FATAL: the database system is starting up . [2004-06-30 08:44:53 AM] Error: Cannot connect to template1, exiting. --- When I first saw this, the problem seemed to still be there, and I saw "FATAL: the database system is starting up" running across my brain in bright red letters. I took a minute to breath, and luckily realized that at least one program (that starts automatically upon reboot) seemed to be accessing the database just fine. At that, I felt a little hope return. So, I shut down Postgres, then restarted Postgres, then restarted the autovacuum utility, and everything worked just peachy. I'm guessing that perhaps the autovacuum tool was trying to connect to Postgres while it was replaying the transaction log from not having been shut down cleanly, and that's why it choked, but I don't know that for sure. -- | Christopher +------------------------------------------------+ | Here I stand. I can do no other. | +------------------------------------------------+
Christopher Cashell <topher-pgsql@zyp.org> writes: > [2004-06-30 08:44:53 AM] Failed connection to database template1 > with error: FATAL: the database system is starting up > . > [2004-06-30 08:44:53 AM] Failed connection to database template1 > with error: FATAL: the database system is starting up > . > [2004-06-30 08:44:53 AM] Error: Cannot connect to template1, > exiting. > So, I shut down Postgres, then restarted Postgres, then restarted > the autovacuum utility, and everything worked just peachy. I'm > guessing that perhaps the autovacuum tool was trying to connect to > Postgres while it was replaying the transaction log from not > having been shut down cleanly, and that's why it choked, but I > don't know that for sure. Yeah, that's what it looks like to me --- autovacuum just a bit too quick to give up. You could've just restarted autovacuum once the database was up. In 7.5 I think autovacuum will be integrated, and the postmaster won't bother to start it till the startup sequence is done, so this won't be an issue. regards, tom lane
Christopher Cashell wrote: > [1] Upon rebooting, the autovacuum utility kind of harfed. Here is > the log entries that it made: > > --- > [2004-06-30 08:44:53 AM] Failed connection to database template1 > with error: FATAL: the database system is starting up > . > [2004-06-30 08:44:53 AM] Failed connection to database template1 > with error: FATAL: the database system is starting up > . > [2004-06-30 08:44:53 AM] Error: Cannot connect to template1, > exiting. > --- > > When I first saw this, the problem seemed to still be there, and I > saw "FATAL: the database system is starting up" running across my > brain in bright red letters. > > I took a minute to breath, and luckily realized that at least one > program (that starts automatically upon reboot) seemed to be > accessing the database just fine. At that, I felt a little hope > return. > > So, I shut down Postgres, then restarted Postgres, then restarted > the autovacuum utility, and everything worked just peachy. I'm > guessing that perhaps the autovacuum tool was trying to connect to > Postgres while it was replaying the transaction log from not > having been shut down cleanly, and that's why it choked, but I > don't know that for sure. This is a known issue with the autovacuum contrib module. This problem goes away when pg_autovacuum is integrated into the backend. I submitted a patch that makes pg_autovacuum a backend process, but so far, I have yet to get any response from anyone. Matthew
Tom Lane <tgl@sss.pgh.pa.us> writes: > Christopher Cashell <topher-pgsql@zyp.org> writes: > > 28424 postgres 18 0 16804 3044 15m D 0.0 1.6 0:06.72 postmaster > > > Note that it does have a process status of 'D', or uninterruptible > > sleep. That would explain the unkillable part, though I'm curious how > > it ended up there. Is there an NFS server involved? If an NFS server disappears any process waiting on I/O for it enters disk-wait indefinitely until it reappears. -- greg
At Sat, 03 Jul 04, Unidentified Flying Banana Greg Stark, said: > Is there an NFS server involved? If an NFS server disappears any process > waiting on I/O for it enters disk-wait indefinitely until it reappears. Nope. Everything was local on this machine. > -- > greg -- | Christopher +------------------------------------------------+ | Here I stand. I can do no other. | +------------------------------------------------+