Thread: Chasing "signal 11" issues
Since Monday I have been seeing "terminated by signal 11" messages in my 7.4.6 + Slon 1.0.5 system,. but only on the master
I've done a dumapall, initdb and restore , which reduced the frequency but I still get them 6-8 times a day.
After turning up logging it seemed to die when calling a very small table (2 rows, 4 columns, 8 char text strings), but manually selecting caused no issues, so I then took a hit and shutdown the system and swapped out the RAM (from earlier list suggestions).
This seemed to work until 7 hours later when the problem has reappeared, at a higher frequency too.
It is ONLY occuring on the master, not on any of the leaf (replicated) nodes, and seems to be triggered by a few different systems connecting (so no common code base)
Suggestions/help ?
I've done a dumapall, initdb and restore , which reduced the frequency but I still get them 6-8 times a day.
After turning up logging it seemed to die when calling a very small table (2 rows, 4 columns, 8 char text strings), but manually selecting caused no issues, so I then took a hit and shutdown the system and swapped out the RAM (from earlier list suggestions).
This seemed to work until 7 hours later when the problem has reappeared, at a higher frequency too.
It is ONLY occuring on the master, not on any of the leaf (replicated) nodes, and seems to be triggered by a few different systems connecting (so no common code base)
Suggestions/help ?
"Tass Chapman" <tasseh.postgres@gmail.com> writes: > Since Monday I have been seeing "terminated by signal 11" messages > in my 7.4.6 + Slon 1.0.5 system,. but only on the master This kind of thing is almost always a hardware problem. 'memtest86' is probably a good first step, and see if any of your cooling fans hanve stopped working. -Doug
Douglas McNaught <doug@mcnaught.org> writes: > "Tass Chapman" <tasseh.postgres@gmail.com> writes: >> Since Monday I have been seeing "terminated by signal 11" messages >> in my 7.4.6 + Slon 1.0.5 system,. but only on the master > This kind of thing is almost always a hardware problem. 'memtest86' > is probably a good first step, and see if any of your cooling fans > hanve stopped working. If nothing about the software or the workload have changed recently, I'd agree with Doug about what to look at. Otherwise ... 7.4.6 is pretty old and we have fixed a number of problems since then. Even if you don't have the energy to migrate to 8.* now, there's very little excuse for not dropping in the latest 7.4 subrelease (7.4.12 I think). regards, tom lane
On Thu, 2006-03-30 at 07:02, Tass Chapman wrote: > Since Monday I have been seeing "terminated by signal 11" messages in > my 7.4.6 + Slon 1.0.5 system,. but only on the master > > I've done a dumapall, initdb and restore , which reduced the frequency > but I still get them 6-8 times a day. > > After turning up logging it seemed to die when calling a very small > table (2 rows, 4 columns, 8 char text strings), but manually selecting > caused no issues, so I then took a hit and shutdown the system and > swapped out the RAM (from earlier list suggestions). > > This seemed to work until 7 hours later when the problem has > reappeared, at a higher frequency too. > > It is ONLY occuring on the master, not on any of the leaf (replicated) > nodes, and seems to be triggered by a few different systems connecting > (so no common code base) As mentioned earlier, this tends to be caused by hardware. Note that it can be caused by buggy software or corrupted binaries as well. It is possible that the binaries you're running on have become corrupted in some small way. You might want to run md5sum across all the binaries (postgresql, slony, etc...) on the bad and good machine and compare them. If the problem is in the hardware, and I think it is, it could be anywhere, bad drive, raid controller, raid cache, scsi interface, CPU, memory, and so on.so, memtest86 might find the problem if it's mainboard / CPU / memory, but if it's an I/O problem, it won't. The most common failures are mechanical in nature. I've had machines that were crashing, and all I had to do was reseat the CPU or memory or heat sink and suddenly it was running fine. However, you need to switch over to your failover machine immediately. Running your main database on what is most likely faulty hardware is a recipe for corruption of your database.