Home > mailing lists

Re: Server error - Mailing list pgsql-general

From	scott.marlowe
Subject	Re: Server error
Date	May 7, 2003 14:37:08
Msg-id	Pine.LNX.4.33.0305070916090.8765-100000@css120.ihs.com Whole thread Raw
In response to	Re: Server error (Tom Lane <tgl@sss.pgh.pa.us>)
List	pgsql-general

Tree view

On Tue, 6 May 2003, Tom Lane wrote:

> "scott.marlowe" <scott.marlowe@ihs.com> writes:
> > On Tue, 6 May 2003, Erik Ronström wrote:
> >> I have a plpgsql function which dies strangely very often, with the
> >> message "server closed the connection unexpectedly". The log file says
>
> > Sig 11 means you have bad memory or CPU, about 99.9% of the time.
>
> In my part of the universe, about 99% of the time it means you've found
> a software bug ;-) ... especially if you can create an example case that
> is reproducible on another machine.  Erik, can you wrap up a test case?
> And which PG version are you running, anyway?

Touche'  I think the real issue is whether or not the error remains the
same each time, occuring in the same exact place, then it is usually code.
But if the sig 11 shows up in different places each time, then it is
likely bad hardware.

Further, just because one gets a sig11 every time they run a certain
stored proc is not necessarily the same as getting one in the same exact
place of the stored proc or postgresql code while it's running.

So, it's a good idea to get several traces of the sig 11, and compare
them.  If they aren't happening in the same place each time, then the
hardware should be checked.

My point on this is that YOU shouldn't be chasing down these problems
until such time as the user has proven that their hardware is sound.
Since bad hardware is pretty common, and your time is a limited resource,
I really feel that if someone is getting sig 11s, they should be directed
to test their hardware first with something like memtest86 and only after
it passes should they come back to you.  Especially right now when you and
the other developers are working hard to get the 7.4 code ready to go.

The old test for bad hardware, by the way, was to compile the linux kernel
a 100 times with a -j <bignum> switch with bignum set high enough to use
all your memory.  Of course, that was back when 64 megs was a fair bit,
so it wasn't hard to get the machine to use it all.  With bigger and
bigger memory subsystems, bad memory is much more likely to stay hidden
until load increases, then boom, you hit that bad bit and get a sig11.
Hence the need for better hardware testing before chasing the software bug
possibility.

pgsql-general by date:

From: "Jimmie H. Apsey"
Date: 07 May 2003, 14:31:38
Subject: Re: Postgres client/server parameters?

From: Dennis Gearon
Date: 07 May 2003, 14:40:54
Subject: Re: Perl DBI::Pg - Stop button

Re: Server error - Mailing list pgsql-general

Previous

Next