RE: [HACKERS] Major bug, possible, with Solaris 7? - Mailing list pgsql-hackers

From Daryl W. Dunbar
Subject RE: [HACKERS] Major bug, possible, with Solaris 7?
Date
Msg-id 002201be5c58$8928c320$1445e59b@ddunbar.eni.net
Whole thread Raw
In response to Major bug, possible, with Solaris 7?  (The Hermit Hacker <scrappy@hub.org>)
Responses RE: [HACKERS] Major bug, possible, with Solaris 7?
List pgsql-hackers
Oh, sorry.  6.4.2 with a backend patch to prevent the parent death
in the event of MaxBackendID being reached.

I know it is in semop() because I did a truss on the child
processes.  From a small sample, it looks like they may all be
trying to operate on the same semaphore.  I'm recompiling with
the -g flag to gain more insight...

DwD

> -----Original Message-----
> From: owner-pgsql-hackers@postgreSQL.org
> [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of The Hermit
> Hacker
> Sent: Friday, February 19, 1999 12:46 PM
> To: pgsql-hackers@postgreSQL.org
> Cc: Daryl W. Dunbar
> Subject: [HACKERS] Major bug, possible, with Solaris 7?
>
>
>
> Can someone please take a minute to look at this?
>
> I've gzip'd and moved his errorlog to
> ftp.postgresql.org:/pub/debugging...one thing that appears to be
> lacking...what version of PostgreSQL are you using?
>
> Marc G. Fournier
> Systems Administrator @ hub.org
> primary: scrappy@hub.org           secondary:
> scrappy@{freebsd|postgresql}.org
>
> ---------- Forwarded message ----------
> Date: Thu, 18 Feb 1999 18:23:25 -0500
> From: Daryl W. Dunbar <daryl@www.com>
> To: The Hermit Hacker <scrappy@hub.org>
> Subject: RE: Interested?
>
> Thanks Marc,  We exchanged an e-mail or two last week, along with
> Tatsuo Ishii and Tom Lane.  You suggested I truss the process.
>
> Anyway, periodically, the backends spiral out of control with hung
> up children until I hit MaxBackendID (which I compiled in to be
> 128).  Initially, I was running out of semaphores on Solaris 7 and
> changed /etc/system to add these lines:
> set shmsys:shminfo_shmmax=16777216
> set shmsys:shminfo_shmmin=1
> set shmsys:shminfo_shmmni=128
> set shmsys:shminfo_shmseg=51
> *
> set semsys:seminfo_semmap=128
> set semsys:seminfo_semmni=128
> set semsys:seminfo_semmns=8192
> set semsys:seminfo_semmnu=8192
> set semsys:seminfo_semmsl=64
> set semsys:seminfo_semopm=32
> set semsys:seminfo_semume=32
>
> I increased shared memory so I could start more backends...
>
> OK, so now, everything is running fine and boom, the
> backends start
> to hang on semop, eventually reaching MaxBackendID and refusing
> connections.
> Attached is a log file from a hang up today.  Debug is set to 3.
> All times are PST.  I have carved out a bunch of normal operation
> from the beginning (about 21,000 lines) and redundant 'too many
> backends' (about 1,000 lines, while I was eating lunch :)
> signified
> by {SNIP SNIP}.  I pick the log back up with the birth of pid 2828
> and left several 'normal' cycles in until...
>
> You can see that process 2840 is the first child to hang.  It was
> started at 11:39:23 and did not die until sent a 15 by
> the parent at
> 14:12:16.  All of the hung processes fall between 2840 and 3454.
>
> Sorry the file is so big.  Here are some 'keys' you can use:
> Startup is the first line (obviously).
> You can find child startup by looking for [2840] (pid in brackets)
> You can find child exits by looking for '2480 exited'
> You can find where I send the kill signal by looking for
> 'pmdie 15'
>
> I think that's a good start. :)
>
> Don't hesitate to contact me if I can shed any more
> light.  I'm wide
> open to ideas at the moment.  I'm in EST, but tend to work until
> 10-11 at night, so e-mail anytime.
>
> Thanks,
>
> DwD
>
> > -----Original Message-----
> > From: The Hermit Hacker [mailto:scrappy@hub.org]
> > Sent: Thursday, February 18, 1999 5:36 PM
> > To: Daryl W. Dunbar
> > Subject: Re: Interested?
> >
> >
> >
> > Hi Daryl...
> >
> >     I'm not the strongest at internal code, so may not
> > be of any help
> > at all.  I just went through my -hackers email, and can't
> > seem to find
> > anything from you in there.  Can you tell me what your
> > problem is, as well
> > as version of PostgreSQL you are using, and we'll see
> > what we can do?
> >
> > Marc
> >
> > On Thu, 18 Feb 1999, Daryl W. Dunbar wrote:
> >
> > > Marc,
> > >
> > > I know that you put considerable volunteer time into
> > PostgreSQL.  If
> > > I am not too bold in asking, and you are comfortable
> > with it, I am
> > > prepared to compensate you for your time if you can
> assist me in
> > > tracking down this rather nasty bug I have been
> > e-mailing Hackers
> > > about.  Please let me know if you are interested and if
> > so, at what
> > > rate.
> > >
> > > We are in the process of launching a pretty exciting
> site and a
> > > database in a integral part of it.  I really want to
> > use PostgreSQL,
> > > but can not take it into production on Solaris with
> this problem
> > > going on.  I'm in the process of installing a test site
> > on Linux to
> > > see if the problem exists there, but I expect it is limited to
> > > Solaris.
> > >
> > > I anxiously await your response.
> > >
> > > Thanks,
> > >
> > > DwD
> > >
> > > --
> > > Daryl W. Dunbar
> > > VP of Engineering/Chief Technology Officer
> > > http://www.com, Where the Web Begins!
> > > mailto:daryl@www.com
> > >
> > >
> >
> > Marc G. Fournier
> > Systems Administrator @ hub.org
> > primary: scrappy@hub.org           secondary:
> > scrappy@{freebsd|postgresql}.org
> >
>
>



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: New optimizer README
Next
From: Tatsuo Ishii
Date:
Subject: Re: [HACKERS] large objects failing (hpux10.20 sparc/solaris 2.6, gcc 2.8.1)