Major bug, possible, with Solaris 7? - Mailing list pgsql-hackers

From The Hermit Hacker
Subject Major bug, possible, with Solaris 7?
Date
Msg-id Pine.BSF.4.05.9902191343090.10574-100000@thelab.hub.org
Whole thread Raw
Responses RE: [HACKERS] Major bug, possible, with Solaris 7?
List pgsql-hackers
Can someone please take a minute to look at this?

I've gzip'd and moved his errorlog to
ftp.postgresql.org:/pub/debugging...one thing that appears to be
lacking...what version of PostgreSQL are you using?

Marc G. Fournier                                
Systems Administrator @ hub.org 
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org 

---------- Forwarded message ----------
Date: Thu, 18 Feb 1999 18:23:25 -0500
From: Daryl W. Dunbar <daryl@www.com>
To: The Hermit Hacker <scrappy@hub.org>
Subject: RE: Interested?

Thanks Marc,  We exchanged an e-mail or two last week, along with
Tatsuo Ishii and Tom Lane.  You suggested I truss the process.

Anyway, periodically, the backends spiral out of control with hung
up children until I hit MaxBackendID (which I compiled in to be
128).  Initially, I was running out of semaphores on Solaris 7 and
changed /etc/system to add these lines:
set shmsys:shminfo_shmmax=16777216
set shmsys:shminfo_shmmin=1
set shmsys:shminfo_shmmni=128
set shmsys:shminfo_shmseg=51
*
set semsys:seminfo_semmap=128
set semsys:seminfo_semmni=128
set semsys:seminfo_semmns=8192
set semsys:seminfo_semmnu=8192
set semsys:seminfo_semmsl=64
set semsys:seminfo_semopm=32
set semsys:seminfo_semume=32

I increased shared memory so I could start more backends...

OK, so now, everything is running fine and boom, the backends start
to hang on semop, eventually reaching MaxBackendID and refusing
connections.
Attached is a log file from a hang up today.  Debug is set to 3.
All times are PST.  I have carved out a bunch of normal operation
from the beginning (about 21,000 lines) and redundant 'too many
backends' (about 1,000 lines, while I was eating lunch :) signified
by {SNIP SNIP}.  I pick the log back up with the birth of pid 2828
and left several 'normal' cycles in until...

You can see that process 2840 is the first child to hang.  It was
started at 11:39:23 and did not die until sent a 15 by the parent at
14:12:16.  All of the hung processes fall between 2840 and 3454.

Sorry the file is so big.  Here are some 'keys' you can use:
Startup is the first line (obviously).
You can find child startup by looking for [2840] (pid in brackets)
You can find child exits by looking for '2480 exited'
You can find where I send the kill signal by looking for 'pmdie 15'

I think that's a good start. :)

Don't hesitate to contact me if I can shed any more light.  I'm wide
open to ideas at the moment.  I'm in EST, but tend to work until
10-11 at night, so e-mail anytime.

Thanks,

DwD

> -----Original Message-----
> From: The Hermit Hacker [mailto:scrappy@hub.org]
> Sent: Thursday, February 18, 1999 5:36 PM
> To: Daryl W. Dunbar
> Subject: Re: Interested?
>
>
>
> Hi Daryl...
>
>     I'm not the strongest at internal code, so may not
> be of any help
> at all.  I just went through my -hackers email, and can't
> seem to find
> anything from you in there.  Can you tell me what your
> problem is, as well
> as version of PostgreSQL you are using, and we'll see
> what we can do?
>
> Marc
>
> On Thu, 18 Feb 1999, Daryl W. Dunbar wrote:
>
> > Marc,
> >
> > I know that you put considerable volunteer time into
> PostgreSQL.  If
> > I am not too bold in asking, and you are comfortable
> with it, I am
> > prepared to compensate you for your time if you can assist me in
> > tracking down this rather nasty bug I have been
> e-mailing Hackers
> > about.  Please let me know if you are interested and if
> so, at what
> > rate.
> >
> > We are in the process of launching a pretty exciting site and a
> > database in a integral part of it.  I really want to
> use PostgreSQL,
> > but can not take it into production on Solaris with this problem
> > going on.  I'm in the process of installing a test site
> on Linux to
> > see if the problem exists there, but I expect it is limited to
> > Solaris.
> >
> > I anxiously await your response.
> >
> > Thanks,
> >
> > DwD
> >
> > --
> > Daryl W. Dunbar
> > VP of Engineering/Chief Technology Officer
> > http://www.com, Where the Web Begins!
> > mailto:daryl@www.com
> >
> >
>
> Marc G. Fournier
> Systems Administrator @ hub.org
> primary: scrappy@hub.org           secondary:
> scrappy@{freebsd|postgresql}.org
>



pgsql-hackers by date:

Previous
From: Dmitry Samersoff
Date:
Subject: What does it means?
Next
From: Vince Vielhaber
Date:
Subject: lower() broken?