Thread: Startup death!

Startup death!

From
"Sam Liddicott"
Date:
From time to time on postgres 7.2 or postgres 7.2.1 we get a case when we have the maximum number of postgres processes all taking all available CPU shared among themselves stuck in "startup" mode (as "ps -fwwwwu postgres" shows).
 
The only cure is to is to do a shutdown (which doesn't work) and then kill -9 one of the stuck-in-startup processes upon which they all die and it shuts down properly within seconds.
 
We then restart postgres and all is well.
 
The only extra info I have is that under 7.2 (not 7.2.1) after such circumstances, if I then did opened a psql process on that DB it would take many (perhaps 10 seconds) before psql gave me a prompt.  If before this time I open the DB to many clients they all get stuck in startup again, but if I wait till after this prompt then they do not get stuck in startup again.
In contrast 7.2.1 psql client gives the prompt right away but the first simple query (select * from channelregion; - a few hundred row) takes maybe 5 seconds the first time.
 
Why are all these processes stuck in startup and taking as much cpu as they can?
 
Sam

Samuel Liddicott
Support Consultant
sam@ananova.com

Direct Dial: +44 (0)113 367 4523
Fax: +44 (0)113 367 4680
Switchboard: +44 (0)113 367 4600

Ananova Limited
Marshall Mill
Marshall Street
Leeds
LS11 9YJ

http://www.ananova.com

Registered Office:
St James Court
Great Park Road
Almondsbury Park
Bradley Stoke
Bristol BS32 4QJ
Registered in England No.2858918

The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you receive this in error, please contact the sender and delete the material from any computer.

 

Re: Startup death!

From
Tom Lane
Date:
"Sam Liddicott" <sam.liddicott@ananova.com> writes:
> Why are all these processes stuck in startup and taking as much cpu as they
> can?

You tell us.  Attach to a few of them with gdb and get stack traces.
(It will help if you've built PG with --enable-debug.)

            regards, tom lane

Re: Startup death!

From
Ericson Smith
Date:
Seems I had this same problem a while back with 7.2.1

We had I/O problems. Our RAID controller driver was acting up. Upgrading
the i20 driver from Redhat finally and definitively solved the problem.

If you check your processlist, you will see that those "startup"
processes are in an Uninterruptible Sleep mode. We ended up having to
hard reboot the machine to shut down Postgresql. After about a week of
this we found out about the driver.

I would love to hear what your solution was, but am almost sure it is
related to a disk i/o issue.

For others in the list... What does it mean when the Postgresql
processes are in startup mode? What is it supposed to be doing in that
mode?

- Ericson Smith
eric@did-it.com

On Thu, 2002-07-18 at 09:57, Tom Lane wrote:
> "Sam Liddicott" <sam.liddicott@ananova.com> writes:
> > Why are all these processes stuck in startup and taking as much cpu as they
> > can?
>
> You tell us.  Attach to a few of them with gdb and get stack traces.
> (It will help if you've built PG with --enable-debug.)
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster



Re: Startup death!

From
"Sam Liddicott"
Date:

> -----Original Message-----
> From: Ericson Smith [mailto:eric@did-it.com]
> Sent: 18 July 2002 15:34
> To: Tom Lane
> Cc: Postgresql General Mailing List
> Subject: Re: [GENERAL] Startup death!
>
>
> Seems I had this same problem a while back with 7.2.1
>
> We had I/O problems. Our RAID controller driver was acting
> up. Upgrading
> the i20 driver from Redhat finally and definitively solved
> the problem.

We're using redhat 7.3 with raid...
When was this that you got the i20 driver update.  Did you have to say any
magic words?  Is it part of any release lately?  What version do you use
now?
For us, lsmod doesn't show any kind of i20
We have /dev/hdi20 which is owned by the dev-3.3-4 package, but it has i21,
i22 etc
The descriptions of all the packages installed don't mention i20

We have unused (no disks) Adaptec AIC7899 and then we actually use a
MegaRAID card.

> I would love to hear what your solution was, but am almost sure it is
> related to a disk i/o issue.

When it next happens we will strace -p and gdb the processes to see what
they are doing.

> For others in the list... What does it mean when the Postgresql
> processes are in startup mode? What is it supposed to be doing in that
> mode?

yeah!

Sam




Re: Startup death!

From
Ericson Smith
Date:
We got the i20 driver update from Adaptec's site, THEN updated RedHat's
kernel using their up2date utility.

Here's the steps:

1. Have your SCSI Raid driver disk ready
2. You need to reinstall RedHat in expert mode so it will *not load* the
default redhat driver for your RAID (this was part of the problem).
3. Insert the SCSI Raid driver when it prompts you
4. Install Linux as necessary
5. As soon as your install is finished, run rhn_register, and up2date to
download the latest kernels for your machine.
6. Install and run Postgres

These are the steps that we used with success.

- Ericson Smith
eric@did-it.com


On Fri, 2002-07-19 at 03:52, Sam Liddicott wrote:
>
>
> > -----Original Message-----
> > From: Ericson Smith [mailto:eric@did-it.com]
> > Sent: 18 July 2002 15:34
> > To: Tom Lane
> > Cc: Postgresql General Mailing List
> > Subject: Re: [GENERAL] Startup death!
> >
> >
> > Seems I had this same problem a while back with 7.2.1
> >
> > We had I/O problems. Our RAID controller driver was acting
> > up. Upgrading
> > the i20 driver from Redhat finally and definitively solved
> > the problem.
>
> We're using redhat 7.3 with raid...
> When was this that you got the i20 driver update.  Did you have to say any
> magic words?  Is it part of any release lately?  What version do you use
> now?
> For us, lsmod doesn't show any kind of i20
> We have /dev/hdi20 which is owned by the dev-3.3-4 package, but it has i21,
> i22 etc
> The descriptions of all the packages installed don't mention i20
>
> We have unused (no disks) Adaptec AIC7899 and then we actually use a
> MegaRAID card.
>
> > I would love to hear what your solution was, but am almost sure it is
> > related to a disk i/o issue.
>
> When it next happens we will strace -p and gdb the processes to see what
> they are doing.
>
> > For others in the list... What does it mean when the Postgresql
> > processes are in startup mode? What is it supposed to be doing in that
> > mode?
>
> yeah!
>
> Sam
>
>
>



Re: Startup death!

From
"Sam Liddicott"
Date:
Thanks you very much, good advice here!
We will try this,
and may bug your personally (?) if we need clarification as it doesn't seem
to be a postgres issue.

Sam

> -----Original Message-----
> From: Ericson Smith [mailto:eric@did-it.com]
> Sent: 19 July 2002 14:03
> To: Sam Liddicott
> Cc: pgsql-general@postgresql.org
> Subject: RE: [GENERAL] Startup death!
>
>
> We got the i20 driver update from Adaptec's site, THEN
> updated RedHat's
> kernel using their up2date utility.
>
> Here's the steps:
>
> 1. Have your SCSI Raid driver disk ready
> 2. You need to reinstall RedHat in expert mode so it will
> *not load* the
> default redhat driver for your RAID (this was part of the problem).
> 3. Insert the SCSI Raid driver when it prompts you
> 4. Install Linux as necessary
> 5. As soon as your install is finished, run rhn_register, and
> up2date to
> download the latest kernels for your machine.
> 6. Install and run Postgres
>
> These are the steps that we used with success.
>
> - Ericson Smith
> eric@did-it.com
>
>
> On Fri, 2002-07-19 at 03:52, Sam Liddicott wrote:
> >
> >
> > > -----Original Message-----
> > > From: Ericson Smith [mailto:eric@did-it.com]
> > > Sent: 18 July 2002 15:34
> > > To: Tom Lane
> > > Cc: Postgresql General Mailing List
> > > Subject: Re: [GENERAL] Startup death!
> > >
> > >
> > > Seems I had this same problem a while back with 7.2.1
> > >
> > > We had I/O problems. Our RAID controller driver was acting
> > > up. Upgrading
> > > the i20 driver from Redhat finally and definitively solved
> > > the problem.
> >
> > We're using redhat 7.3 with raid...
> > When was this that you got the i20 driver update.  Did you
> have to say any
> > magic words?  Is it part of any release lately?  What
> version do you use
> > now?
> > For us, lsmod doesn't show any kind of i20
> > We have /dev/hdi20 which is owned by the dev-3.3-4 package,
> but it has i21,
> > i22 etc
> > The descriptions of all the packages installed don't mention i20
> >
> > We have unused (no disks) Adaptec AIC7899 and then we actually use a
> > MegaRAID card.
> >
> > > I would love to hear what your solution was, but am
> almost sure it is
> > > related to a disk i/o issue.
> >
> > When it next happens we will strace -p and gdb the
> processes to see what
> > they are doing.
> >
> > > For others in the list... What does it mean when the Postgresql
> > > processes are in startup mode? What is it supposed to be
> doing in that
> > > mode?
> >
> > yeah!
> >
> > Sam
> >
> >
> >
>
>




Re: Startup death!

From
"Sam Liddicott"
Date:

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: 18 July 2002 14:57
> To: Sam Liddicott
> Cc: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Startup death!
>
>
> "Sam Liddicott" <sam.liddicott@ananova.com> writes:
> > Why are all these processes stuck in startup and taking as
> much cpu as they
> > can?
>
> You tell us.  Attach to a few of them with gdb and get stack traces.
> (It will help if you've built PG with --enable-debug.)

Here's the output of the script I wrote for support to run when the error
occurs.
It has the ps list, a gdb and an strace for 10 seconds of a stuck process.
I'll fix the gdb hangup error so maybe we get the full stack trace (it
worked in testing!)

UID        PID  PPID  C STIME TTY          TIME CMD
postgres 23609     1  0 Aug08 ?        00:00:58 /usr/bin/postmaster
postgres 23611 23609  0 Aug08 ?        00:01:20 postgres: stats buffer
process
postgres 23612 23611  5 Aug08 ?        01:21:06 postgres: stats collector
process
postgres 31028 23609  8 04:31 ?        00:01:34 postgres: tv tv [local]
startup
postgres 31239 23609  9 04:32 ?        00:01:39 postgres: tv tv 10.30.10.105
startup
postgres 31429 23609  4 04:36 ?        00:00:35 postgres: tv tv 10.30.10.101
startup
postgres 31458 23609  4 04:36 ?        00:00:36 postgres: tv tv 10.30.10.105
startup
postgres 31484 23609  4 04:36 ?        00:00:37 postgres: tv tv 10.30.10.105
startup
postgres 31495 23609  5 04:37 ?        00:00:37 postgres: tv tv 10.30.10.104
startup
postgres 31704 23609  4 04:37 ?        00:00:29 postgres: tv tv 10.30.10.104
startup
postgres 31719 23609  4 04:38 ?        00:00:29 postgres: tv tv 10.30.10.102
startup
postgres 31738 23609  4 04:38 ?        00:00:27 postgres: tv tv 10.30.10.102
startup
postgres 31761 23609  2 04:38 ?        00:00:17 postgres: tv tv 10.30.10.103
startup
postgres 31766 23609  2 04:38 ?        00:00:13 postgres: tv tv 10.30.10.103
startup
postgres 31791 23609  3 04:39 ?        00:00:24 postgres: tv tv 10.30.10.102
startup
postgres 31799 23609  3 04:39 ?        00:00:22 postgres: tv tv 10.30.10.102
startup
postgres 31820 23609  3 04:39 ?        00:00:21 postgres: tv tv 10.30.10.104
startup
postgres 31841 23609  0 04:40 ?        00:00:01 postgres: tv tv 10.30.10.101
startup
postgres 31842 23609  0 04:40 ?        00:00:01 postgres: tv tv 10.30.10.105
startup
postgres 31846 23609  0 04:40 ?        00:00:02 postgres: tv tv 10.30.10.105
startup
postgres 31857 23609  0 04:40 ?        00:00:03 postgres: tv tv 10.30.10.102
startup
postgres 31889 23609  0 04:40 ?        00:00:03 postgres: tv tv 10.30.10.104
startup
postgres 31899 23609  0 04:41 ?        00:00:03 postgres: tv tv 10.30.10.102
startup
postgres 31910 23609  0 04:41 ?        00:00:02 postgres: tv tv 10.30.10.105
startup
postgres 31911 23609  0 04:41 ?        00:00:02 postgres: tv tv 10.30.10.103
startup
postgres 31943 23609  0 04:41 ?        00:00:04 postgres: tv tv 10.30.10.101
startup
postgres 31947 23609  0 04:42 ?        00:00:03 postgres: tv tv 10.30.10.102
startup
postgres 32046 23609  0 04:42 ?        00:00:04 postgres: tv tv 10.30.10.101
startup
postgres 32136 23609  0 04:42 ?        00:00:02 postgres: tv tv 10.30.10.105
startup
postgres 32140 23609  0 04:42 ?        00:00:02 postgres: tv tv 10.30.10.105
startup
postgres 32141 23609  0 04:42 ?        00:00:02 postgres: tv tv 10.30.10.102
startup
postgres 32163 23609  1 04:42 ?        00:00:04 postgres: tv tv 10.30.10.102
startup
postgres 32174 23609  0 04:43 ?        00:00:01 postgres: tv tv 10.30.10.102
startup
postgres 32175 23609  0 04:43 ?        00:00:01 postgres: tv tv 10.30.10.103
startup
postgres 32176 23609  0 04:43 ?        00:00:01 postgres: tv tv 10.30.10.101
startup
postgres 32180 23609  0 04:43 ?        00:00:01 postgres: tv tv 10.30.10.101
startup
postgres 32188 23609  0 04:43 ?        00:00:01 postgres: tv tv 10.30.10.101
startup
postgres 32189 23609  0 04:43 ?        00:00:00 postgres: tv tv 10.30.10.104
startup
postgres 32190 23609  0 04:43 ?        00:00:00 postgres: tv tv 10.30.10.104
startup
postgres 32200 23609  0 04:43 ?        00:00:01 postgres: tv tv 10.30.10.105
startup
postgres 32204 23609  0 04:43 ?        00:00:02 postgres: tv tv 10.30.10.105
startup
postgres 32236 23609  0 04:43 ?        00:00:02 postgres: tv tv 10.30.10.101
startup
postgres 32238 23609  0 04:44 ?        00:00:01 postgres: tv tv 10.30.10.103
startup
postgres 32242 23609  0 04:44 ?        00:00:01 postgres: tv tv 10.30.10.101
startup
postgres 32243 23609  0 04:44 ?        00:00:01 postgres: tv tv 10.30.10.104
startup
postgres 32249 23609  0 04:44 ?        00:00:01 postgres: tv tv 10.30.10.101
startup
postgres 32254 23609  0 04:44 ?        00:00:01 postgres: tv tv 10.30.10.103
startup
postgres 32258 23609  0 04:44 ?        00:00:01 postgres: tv tv 10.30.10.105
startup
postgres 32262 23609  0 04:44 ?        00:00:02 postgres: tv tv 10.30.10.101
startup
postgres 32272 23609  0 04:44 ?        00:00:01 postgres: tv tv 10.30.10.101
startup
postgres 32296 23609  0 04:45 ?        00:00:02 postgres: tv tv 10.30.10.102
startup
postgres 32344 23609  1 04:45 ?        00:00:03 postgres: tv tv 10.30.10.105
startup
postgres 32361 23609  0 04:45 ?        00:00:01 postgres: tv tv 10.30.10.104
startup
postgres 32365 23609  0 04:45 ?        00:00:02 postgres: tv tv 10.30.10.103
startup
postgres 32395 23609  1 04:45 ?        00:00:02 postgres: tv tv 10.30.10.101
startup
postgres 32410 23609  1 04:46 ?        00:00:02 postgres: tv tv 10.30.10.102
startup
postgres 32431 23609  2 04:46 ?        00:00:03 postgres: tv tv 10.30.10.104
startup
postgres 32435 23609  0 04:46 ?        00:00:01 postgres: tv tv 10.30.10.103
startup
postgres 32436 23609  0 04:46 ?        00:00:01 postgres: tv tv 10.30.10.105
startup
postgres 32462 23609  1 04:47 ?        00:00:01 postgres: tv tv 10.30.10.101
startup
postgres 32619 23609  2 04:47 ?        00:00:03 postgres: tv tv 10.30.10.103
startup
postgres 32680 23609  1 04:47 ?        00:00:01 postgres: tv tv 10.30.10.101
startup
postgres 32712 23609  1 04:48 ?        00:00:01 postgres: tv tv 10.30.10.104
startup
postgres 32721 23609  1 04:48 ?        00:00:00 postgres: tv tv 10.30.10.102
startup
postgres 32732 23609  0 04:48 ?        00:00:00 postgres: tv tv 10.30.10.105
startup
postgres 32765 23609  1 04:49 ?        00:00:00 postgres: tv tv 10.30.10.101
startup
postgres   304 23609  0 04:49 ?        00:00:00 postgres: tv tv 10.30.10.105
startup
postgres   318 23609  0 04:49 ?        00:00:00 postgres: tv tv 10.30.10.101
startup
postgres   324 23609  0 04:49 ?        00:00:00 postgres: tv tv 10.30.10.104
startup
------
doing 31028

31028 ===============================
GNU gdb Red Hat Linux (5.1.90CVS-5)
Copyright 2002 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux".
Attaching to process 31028
Reading symbols from /usr/bin/postgres...done.
Reading symbols from /lib/libpam.so.0...done.
Loaded symbols for /lib/libpam.so.0
Reading symbols from /lib/libssl.so.2...done.
Loaded symbols for /lib/libssl.so.2
Reading symbols from /lib/libcrypto.so.2...done.
Loaded symbols for /lib/libcrypto.so.2
Reading symbols from /usr/kerberos/lib/libkrb5.so.3...done.
Loaded symbols for /usr/kerberos/lib/libkrb5.so.3
Reading symbols from /usr/kerberos/lib/libk5crypto.so.3...done.
Loaded symbols for /usr/kerberos/lib/libk5crypto.so.3
Reading symbols from /usr/kerberos/lib/libcom_err.so.3...done.
Loaded symbols for /usr/kerberos/lib/libcom_err.so.3
Reading symbols from /usr/lib/libz.so.1...done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /lib/libcrypt.so.1...done.
Loaded symbols for /lib/libcrypt.so.1
Reading symbols from /lib/libresolv.so.2...done.
Loaded symbols for /lib/libresolv.so.2
Reading symbols from /lib/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/i686/libm.so.6...done.
Loaded symbols for /lib/i686/libm.so.6
Reading symbols from /usr/lib/libreadline.so.4...done.
Loaded symbols for /usr/lib/libreadline.so.4
Reading symbols from /lib/libtermcap.so.2...done.
Loaded symbols for /lib/libtermcap.so.2
Reading symbols from /lib/i686/libc.so.6...done.
Loaded symbols for /lib/i686/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
Reading symbols from /usr/lib/gconv/ISO8859-1.so...done.
Loaded symbols for /usr/lib/gconv/ISO8859-1.so
0x420e8b52 in semop () from /lib/i686/libc.so.6
(gdb) Hangup detected on fd 0
error detected on stdin
Detaching from program: /usr/bin/postgres, process 31028
semop(3538944, 0xbfffe150, 1)           = 0
semop(3538944, 0xbfffe150, 1)           = 0
semop(3538944, 0xbfffe150, 1)           = 0
semop(3538944, 0xbfffe150, 1)           = 0
semop(3538944, 0xbfffe150, 1)           = 0
semop(3538944, 0xbfffe150, 1)           = 0
semop(3538944, 0xbfffe150, 1)           = 0
semop(3538944, 0xbfffe1c0, 1)           = 0
semop(3571713, 0xbfffe170, 1)           = 0
semop(3571713, 0xbfffe170, 1)           = 0
semop(3538944, 0xbfffe1c0, 1)           = 0
semop(3571713, 0xbfffe170, 1)           = 0
semop(3538944, 0xbfffe1c0, 1)           = 0
semop(3571713, 0xbfffe170, 1)           = 0
semop(3538944, 0xbfffe150, 1)           = 0
semop(3538944, 0xbfffe150, 1)           = 0
semop(3538944, 0xbfffe150, 1)           = 0
semop(3571713, 0xbfffe170, 1)           = 0
semop(3538944, 0xbfffe150, 1)           = 0
semop(3538944, 0xbfffe150, 1)           = 0
semop(3538944, 0xbfffe150, 1)           = 0
semop(3538944, 0xbfffe150, 1)           = 0
semop(3538944, 0xbfffe150, 1)           = 0
semop(3538944, 0xbfffe150, 1)           = 0
semop(3538944, 0xbfffe150, 1)           = 0
semop(3538944, 0xbfffe150, 1)           = 0
semop(3538944, 0xbfffe1c0, 1)           = 0
semop(3571713, 0xbfffe150, 1)           = 0
semop(3538944, 0xbfffe150, 1)           = 0
semop(3571713, 0xbfffe170, 1)           = 0
semop(3538944, 0xbfffe1c0, 1)           = 0
semop(3571713, 0xbfffe170, 1)           = 0
semop(3538944, 0xbfffe1c0, 1)           = 0
semop(3538944, 0xbfffe1c0, 1)           = 0
semop(3538944, 0xbfffe170, 1)           = 0
semop(3538944, 0xbfffe150, 1)           = 0
semop(3538944, 0xbfffe1c0, 1)           = 0
semop(3604482, 0xbfffe150, 1)           = 0
semop(3538944, 0xbfffe150, 1)           = 0
semop(3538944, 0xbfffe1c0, 1)           = 0
semop(3571713, 0xbfffe150, 1)           = 0
semop(3538944, 0xbfffe150, 1)           = 0
semop(3538944, 0xbfffe150, 1------------------------------