On Wed, 2004-06-30 at 21:41, Scott Marlowe wrote:
> On Wed, 2004-06-30 at 18:57, Christopher Cashell wrote:
> > Yesterday, while attempting to access a database, I received errors
> > saying that the database was innaccessible. After investigating a
> > little, I found the following in the PostgreSQL log files:
> >
> > 2004-06-30 08:30:19 [24119] LOG: checkpoint process (PID 28423) was
> > terminated by signal 11
>
> > Eventually I attempted to shut it down and restart it, however that
> > failed too. When I attempted to shut it down, I discovered a hung
> > 'startup subprocess' that can't be killed.
> >
> > nexus:~# ps aux | grep postgres
> > postgres 28424 0.0 1.5 16804 3044 pts/313 D 08:35 0:06 postgres:
> > startup subprocess
> > nexus:~# kill -9 28424
> > nexus:~# ps aux | grep postgres
> > postgres 28424 0.0 1.5 16804 3044 pts/313 D 08:35 0:06 postgres:
> > startup subprocess
> > nexus:~#
>
> The combination of a Sig 11 failure and a process stuck in a D state
> makes me lean towards thinking it's bad hardware (CPU or memory). Have
> you tested this machine?
Oh, and a possibly buggy kernel or kernel module somewhere as well.
Didn't mean to not say it, and have had problems with some kernels under
heavy parallel loads doing stupid things that look just like this.