Re: Chasing "signal 11" issues - Mailing list pgsql-general

From Scott Marlowe
Subject Re: Chasing "signal 11" issues
Date
Msg-id 1143732927.26940.11.camel@state.g2switchworks.com
Whole thread Raw
In response to Chasing "signal 11" issues  ("Tass Chapman" <tasseh.postgres@gmail.com>)
List pgsql-general
On Thu, 2006-03-30 at 07:02, Tass Chapman wrote:
> Since Monday I have been seeing "terminated by signal 11" messages in
> my 7.4.6 + Slon 1.0.5 system,. but only on the master
>
> I've done a dumapall, initdb and restore , which reduced the frequency
> but I still get them 6-8 times a day.
>
> After turning up logging it seemed to die when calling a very small
> table (2 rows, 4 columns, 8 char text strings), but manually selecting
> caused no issues, so I then took a hit and shutdown the system and
> swapped out the RAM (from earlier list suggestions).
>
> This seemed to work until 7 hours later when the problem has
> reappeared, at a higher frequency too.
>
> It is ONLY occuring on the master, not on any of the leaf (replicated)
> nodes, and seems to be triggered by a few different systems connecting
> (so no common code base)

As mentioned earlier, this tends to be caused by hardware.  Note that it
can be caused by buggy software or corrupted binaries as well.

It is possible that the binaries you're running on have become corrupted
in some small way.  You might want to run md5sum across all the binaries
(postgresql, slony, etc...) on the bad and good machine and compare
them.

If the problem is in the hardware, and I think it is, it could be
anywhere, bad drive, raid controller, raid cache, scsi interface, CPU,
memory, and so on.so, memtest86 might find the problem if it's mainboard
/ CPU / memory, but if it's an I/O problem, it won't.

The most common failures are mechanical in nature.  I've had machines
that were crashing, and all I had to do was reseat the CPU or memory or
heat sink and suddenly it was running fine.

However, you need to switch over to your failover machine immediately.
Running your main database on what is most likely faulty hardware is a
recipe for corruption of your database.

pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: Chasing "signal 11" issues
Next
From: "Christopher Condit"
Date:
Subject: Re: pgsql and streams