Thread: More postmaster troubles

More postmaster troubles

From
"Daryl W. Dunbar"
Date:
Hello again,

Thanks again to those who pointed me to the semaphore problem.  I,
unfortunately have another problem:

Solaris7 on a Sparc20 running 6.4.2.  Occasionally (once or twice a
day) under a very light load, brain-dead child processes begin to
accumulate in my system.  If left unchecked, eventually the parent
process runs out of resources and dies, orphaning all the lost
processes.  (Now that I have solved the semaphore error, it appears
to be the backend limit of 64 processes.)

Here is a snapshot of truss on some of the processes:
# truss -p 5879
semop(259915776, 0xEFFFC560, 1) (sleeping...)
# truss -p 5912
semop(259915776, 0xEFFFC190, 1) (sleeping...)
# truss -p 5915
semop(259915776, 0xEFFFC190, 1) (sleeping...)
# truss -p 5931
semop(259915776, 0xEFFFC280, 1) (sleeping...)
# truss -p 5926
semop(259915776, 0xEFFFC280, 1) (sleeping...)

They all appear to be waiting on a semaphore operation which
apparently never happens.  The number of stalled processes grows
rapidly (it has gone from 12 to 21 while I wrote this e-mail).

The stalled processes all started between 6:57am PST and 7:18am PST,
here is what postmaster wrote to the log:
Feb 12 06:56:46 constantinople POSTMASTER: FATAL: pq_putnchar:
fputc() failed: errno=32
Feb 12 06:57:42 constantinople POSTMASTER: NOTICE:  Deadlock
detected -- See the lock(l) manual page for a possible cause.
Feb 12 06:57:42 constantinople POSTMASTER: ERROR:  WaitOnLock: error
on wakeup - Aborting this transaction
Feb 12 06:57:42 constantinople POSTMASTER: NOTICE:  Deadlock
detected -- See the lock(l) manual page for a possible cause.
Feb 12 06:57:42 constantinople POSTMASTER: ERROR:  WaitOnLock: error
on wakeup - Aborting this transaction
Feb 12 07:02:18 constantinople POSTMASTER: FATAL: pq_putnchar:
fputc() failed: errno=32
Feb 12 07:02:19 constantinople last message repeated 2 times

Most of the time, things just work, but it appears that once
somethins has gone awry, I experience a spiraling death.

Thoughts?  Suggestions?  Help? :)

DwD
--
Daryl W. Dunbar
http://www.com, Where the Web Begins!
mailto:daryl@www.com



Re: [HACKERS] More postmaster troubles

From
Tatsuo Ishii
Date:
> Solaris7 on a Sparc20 running 6.4.2.  Occasionally (once or twice a
> day) under a very light load, brain-dead child processes begin to
> accumulate in my system.  If left unchecked, eventually the parent
> process runs out of resources and dies, orphaning all the lost
> processes.  (Now that I have solved the semaphore error, it appears
> to be the backend limit of 64 processes.)

Have you installed following patches? This solves the problem when #
of backends reaches MaxBackendId. I'm not sure if your problem relates
to this, though.

-------------------------------- cut here ---------------------------
*** postgresql-6.4.2/src/backend/postmaster/postmaster.c.orig    Sun Nov 29 10:52:32 1998
--- postgresql-6.4.2/src/backend/postmaster/postmaster.c    Sat Jan  9 18:14:52 1999
***************
*** 238,243 ****
--- 238,244 ---- static long PostmasterRandom(void); static void RandomSalt(char *salt); static void
SignalChildren(SIGNAL_ARGS);
+ static int CountChildren(void);  #ifdef CYR_RECODE void        GetCharSetByHost(char *, int, char *);
***************
*** 754,764 ****                  * by the backend.                  */ 
!                 if (BackendStartup(port) != STATUS_OK)
!                     PacketSendError(&port->pktInfo,                                     "Backend startup failed");
!                 else
!                     status = STATUS_ERROR;             }              /* Close the connection if required. */
--- 755,771 ----                  * by the backend.                  */ 
!                                 if (CountChildren() < MaxBackendId) {
!                     if (BackendStartup(port) != STATUS_OK)
!                         PacketSendError(&port->pktInfo,                                     "Backend startup
failed");
!                     else {
!                         status = STATUS_ERROR;
!                     }
!                 } else {
!                     PacketSendError(&port->pktInfo,
!                     "There are too many backends");
!                 }             }              /* Close the connection if required. */
***************
*** 1617,1620 ****
--- 1624,1655 ----     }      return random() ^ random_seed;
+ }
+ 
+ /*
+  * Count up number of chidren processes.
+  */
+ static int
+ CountChildren(void)
+ {
+     Dlelem       *curr,
+                *next;
+     Backend    *bp;
+     int            mypid = getpid();
+     int    cnt = 0;
+ 
+     curr = DLGetHead(BackendList);
+     while (curr)
+     {
+         next = DLGetSucc(curr);
+         bp = (Backend *) DLE_VAL(curr);
+ 
+         if (bp->pid != mypid)
+         {
+             cnt++;
+         }
+ 
+         curr = next;
+     }
+     return(cnt); }



RE: [HACKERS] More postmaster troubles

From
"Daryl W. Dunbar"
Date:
Thank you Tatsousan.  This patch will solve the dying process
problem when I reach MaxBackendId (which I increased from 64 to
128).  However, I do not know what is causing the spiraling death of
the processes in the first place. :(

Is there some place I should be looking for other patches, besides
those listed on www.postgresql.org?

Thank you for your continued help.

DwD

> -----Original Message-----
> From: t-ishii@ext16.sra.co.jp
> [mailto:t-ishii@ext16.sra.co.jp]On Behalf
> Of Tatsuo Ishii
> Sent: Saturday, February 13, 1999 1:03 AM
> To: Daryl W. Dunbar
> Cc: pgsql-hackers@postgreSQL. org
> Subject: Re: [HACKERS] More postmaster troubles
>
>
> > Solaris7 on a Sparc20 running 6.4.2.  Occasionally
> (once or twice a
> > day) under a very light load, brain-dead child
> processes begin to
> > accumulate in my system.  If left unchecked, eventually
> the parent
> > process runs out of resources and dies, orphaning all the lost
> > processes.  (Now that I have solved the semaphore
> error, it appears
> > to be the backend limit of 64 processes.)
>
> Have you installed following patches? This solves the
> problem when #
> of backends reaches MaxBackendId. I'm not sure if your
> problem relates
> to this, though.
>
> -------------------------------- cut here
> ---------------------------
> ***
> postgresql-6.4.2/src/backend/postmaster/postmaster.c.orig
> Sun Nov 29 10:52:32 1998
> --- postgresql-6.4.2/src/backend/postmaster/postmaster.c
> Sat Jan  9 18:14:52 1999
> ***************
> *** 238,243 ****
> --- 238,244 ----
>   static long PostmasterRandom(void);
>   static void RandomSalt(char *salt);
>   static void SignalChildren(SIGNAL_ARGS);
> + static int CountChildren(void);
>
>   #ifdef CYR_RECODE
>   void        GetCharSetByHost(char *, int, char *);
> ***************
> *** 754,764 ****
>                    * by the backend.
>                    */
>
> !                 if (BackendStartup(port) !=
> STATUS_OK)
> !
> PacketSendError(&port->pktInfo,
>
>     "Backend startup failed");
> !                 else
> !                     status = STATUS_ERROR;
>               }
>
>               /* Close the connection if required. */
> --- 755,771 ----
>                    * by the backend.
>                    */
>
> !                                 if (CountChildren() <
> MaxBackendId) {
> !                     if
> (BackendStartup(port) != STATUS_OK)
> !
> PacketSendError(&port->pktInfo,
>
>     "Backend startup failed");
> !                     else {
> !                         status =
> STATUS_ERROR;
> !                     }
> !                 } else {
> !
> PacketSendError(&port->pktInfo,
> !                     "There are too many
> backends");
> !                 }
>               }
>
>               /* Close the connection if required. */
> ***************
> *** 1617,1620 ****
> --- 1624,1655 ----
>       }
>
>       return random() ^ random_seed;
> + }
> +
> + /*
> +  * Count up number of chidren processes.
> +  */
> + static int
> + CountChildren(void)
> + {
> +     Dlelem       *curr,
> +                *next;
> +     Backend    *bp;
> +     int            mypid = getpid();
> +     int    cnt = 0;
> +
> +     curr = DLGetHead(BackendList);
> +     while (curr)
> +     {
> +         next = DLGetSucc(curr);
> +         bp = (Backend *) DLE_VAL(curr);
> +
> +         if (bp->pid != mypid)
> +         {
> +             cnt++;
> +         }
> +
> +         curr = next;
> +     }
> +     return(cnt);
>   }
>



Re: [HACKERS] More postmaster troubles

From
Tom Lane
Date:
"Daryl W. Dunbar" <daryl@www.com> writes:
> Thank you Tatsousan.  This patch will solve the dying process
> problem when I reach MaxBackendId (which I increased from 64 to
> 128).  However, I do not know what is causing the spiraling death of
> the processes in the first place. :(

Hmm.  I have noticed at least one place in the code where there is an
undocumented hard-wired dependency on MaxBackendId, to wit MAX_PROC_SEMS
in include/storage/proc.h which is set at 128.  Presumably it should be
equal to MaxBackendId (and I intend to fix that soon).  Evidently that
particular bug is not hurting you (yet) but perhaps there are similar
errors elsewhere that kick in sooner.  Do you see the spiraling-death
problem if you run with MaxBackendId at its customary value of 64?

The log extract you posted before mentions "fputc() failed: errno=32"
which suggests an unexpected client disconnect during a transaction.
I suspect the backend that gets that disconnect is failing to clean up
properly before exiting, and is leaving one or more locks locked.
We don't have enough info yet to track down the cause, but I suggest
we could narrow it down some by seeing whether the problem goes away
with a lower MaxBackendId setting.

(You might also want to work on making your clients more robust,
but I'd like to see if we can solve the backend bug first ...)
        regards, tom lane


RE: [HACKERS] More postmaster troubles

From
"Daryl W. Dunbar"
Date:
Tom,

I have to date experienced the problem only with MaxBackendId set to
64.  Today I installed a version of the code with it set to 128
(just picked that number out of luck, but would like to get it
higher).  By the way, I had to tune the kernel to allow me to
increase MaxBackendId, this time in shared memory (SHMMAX).

As for the clients, they are web users via mod_perl/DBI/DBD:Pg.  It
is possible that the user is hitting the stop button right at a time
which hangs the connection (backend), but I have been unable to
reproduce that so far.  That was my first thought on this problem.
The fact that it apparently spirals is disturbing, I highly doubt
there is a user out there hitting the stop key 64 times in a row. :)

Thanks for your help,

DwD

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Saturday, February 13, 1999 3:23 PM
> To: Daryl W. Dunbar
> Cc: pgsql-hackers@postgreSQL.org
> Subject: Re: [HACKERS] More postmaster troubles
>
>
> "Daryl W. Dunbar" <daryl@www.com> writes:
> > Thank you Tatsousan.  This patch will solve the dying process
> > problem when I reach MaxBackendId (which I increased from 64 to
> > 128).  However, I do not know what is causing the
> spiraling death of
> > the processes in the first place. :(
>
> Hmm.  I have noticed at least one place in the code where
> there is an
> undocumented hard-wired dependency on MaxBackendId, to
> wit MAX_PROC_SEMS
> in include/storage/proc.h which is set at 128.
> Presumably it should be
> equal to MaxBackendId (and I intend to fix that soon).
> Evidently that
> particular bug is not hurting you (yet) but perhaps there
> are similar
> errors elsewhere that kick in sooner.  Do you see the
> spiraling-death
> problem if you run with MaxBackendId at its customary value of 64?
>
> The log extract you posted before mentions "fputc()
> failed: errno=32"
> which suggests an unexpected client disconnect during a
> transaction.
> I suspect the backend that gets that disconnect is
> failing to clean up
> properly before exiting, and is leaving one or more locks locked.
> We don't have enough info yet to track down the cause,
> but I suggest
> we could narrow it down some by seeing whether the
> problem goes away
> with a lower MaxBackendId setting.
>
> (You might also want to work on making your clients more robust,
> but I'd like to see if we can solve the backend bug first ...)
>
>             regards, tom lane
>