Re: Server error - Mailing list pgsql-general

From scott.marlowe
Subject Re: Server error
Date
Msg-id Pine.LNX.4.33.0305070916090.8765-100000@css120.ihs.com
Whole thread Raw
In response to Re: Server error  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
On Tue, 6 May 2003, Tom Lane wrote:

> "scott.marlowe" <scott.marlowe@ihs.com> writes:
> > On Tue, 6 May 2003, Erik Ronström wrote:
> >> I have a plpgsql function which dies strangely very often, with the
> >> message "server closed the connection unexpectedly". The log file says
>
> > Sig 11 means you have bad memory or CPU, about 99.9% of the time.
>
> In my part of the universe, about 99% of the time it means you've found
> a software bug ;-) ... especially if you can create an example case that
> is reproducible on another machine.  Erik, can you wrap up a test case?
> And which PG version are you running, anyway?

Touche'  I think the real issue is whether or not the error remains the
same each time, occuring in the same exact place, then it is usually code.
But if the sig 11 shows up in different places each time, then it is
likely bad hardware.

Further, just because one gets a sig11 every time they run a certain
stored proc is not necessarily the same as getting one in the same exact
place of the stored proc or postgresql code while it's running.

So, it's a good idea to get several traces of the sig 11, and compare
them.  If they aren't happening in the same place each time, then the
hardware should be checked.

My point on this is that YOU shouldn't be chasing down these problems
until such time as the user has proven that their hardware is sound.
Since bad hardware is pretty common, and your time is a limited resource,
I really feel that if someone is getting sig 11s, they should be directed
to test their hardware first with something like memtest86 and only after
it passes should they come back to you.  Especially right now when you and
the other developers are working hard to get the 7.4 code ready to go.

The old test for bad hardware, by the way, was to compile the linux kernel
a 100 times with a -j <bignum> switch with bignum set high enough to use
all your memory.  Of course, that was back when 64 megs was a fair bit,
so it wasn't hard to get the machine to use it all.  With bigger and
bigger memory subsystems, bad memory is much more likely to stay hidden
until load increases, then boom, you hit that bad bit and get a sig11.
Hence the need for better hardware testing before chasing the software bug
possibility.


pgsql-general by date:

Previous
From: "Jimmie H. Apsey"
Date:
Subject: Re: Postgres client/server parameters?
Next
From: Dennis Gearon
Date:
Subject: Re: Perl DBI::Pg - Stop button