Thread: More postmaster troubles

Re: [HACKERS] More postmaster troubles

From

Tatsuo Ishii

Date:

13 February 1999, 04:07:10

> Solaris7 on a Sparc20 running 6.4.2.  Occasionally (once or twice a
> day) under a very light load, brain-dead child processes begin to
> accumulate in my system.  If left unchecked, eventually the parent
> process runs out of resources and dies, orphaning all the lost
> processes.  (Now that I have solved the semaphore error, it appears
> to be the backend limit of 64 processes.)

Have you installed following patches? This solves the problem when #
of backends reaches MaxBackendId. I'm not sure if your problem relates
to this, though.

-------------------------------- cut here ---------------------------
*** postgresql-6.4.2/src/backend/postmaster/postmaster.c.orig    Sun Nov 29 10:52:32 1998
--- postgresql-6.4.2/src/backend/postmaster/postmaster.c    Sat Jan  9 18:14:52 1999
***************
*** 238,243 ****
--- 238,244 ---- static long PostmasterRandom(void); static void RandomSalt(char *salt); static void
SignalChildren(SIGNAL_ARGS);
+ static int CountChildren(void);  #ifdef CYR_RECODE void        GetCharSetByHost(char *, int, char *);
***************
*** 754,764 ****                  * by the backend.                  */ 
!                 if (BackendStartup(port) != STATUS_OK)
!                     PacketSendError(&port->pktInfo,                                     "Backend startup failed");
!                 else
!                     status = STATUS_ERROR;             }              /* Close the connection if required. */
--- 755,771 ----                  * by the backend.                  */ 
!                                 if (CountChildren() < MaxBackendId) {
!                     if (BackendStartup(port) != STATUS_OK)
!                         PacketSendError(&port->pktInfo,                                     "Backend startup
failed");
!                     else {
!                         status = STATUS_ERROR;
!                     }
!                 } else {
!                     PacketSendError(&port->pktInfo,
!                     "There are too many backends");
!                 }             }              /* Close the connection if required. */
***************
*** 1617,1620 ****
--- 1624,1655 ----     }      return random() ^ random_seed;
+ }
+ 
+ /*
+  * Count up number of chidren processes.
+  */
+ static int
+ CountChildren(void)
+ {
+     Dlelem       *curr,
+                *next;
+     Backend    *bp;
+     int            mypid = getpid();
+     int    cnt = 0;
+ 
+     curr = DLGetHead(BackendList);
+     while (curr)
+     {
+         next = DLGetSucc(curr);
+         bp = (Backend *) DLE_VAL(curr);
+ 
+         if (bp->pid != mypid)
+         {
+             cnt++;
+         }
+ 
+         curr = next;
+     }
+     return(cnt); }

RE: [HACKERS] More postmaster troubles

From

"Daryl W. Dunbar"

Date:

13 February 1999, 16:23:38

Thank you Tatsousan.  This patch will solve the dying process
problem when I reach MaxBackendId (which I increased from 64 to
128).  However, I do not know what is causing the spiraling death of
the processes in the first place. :(

Is there some place I should be looking for other patches, besides
those listed on www.postgresql.org?

Thank you for your continued help.

DwD

> -----Original Message-----
> From: t-ishii@ext16.sra.co.jp
> [mailto:t-ishii@ext16.sra.co.jp]On Behalf
> Of Tatsuo Ishii
> Sent: Saturday, February 13, 1999 1:03 AM
> To: Daryl W. Dunbar
> Cc: pgsql-hackers@postgreSQL. org
> Subject: Re: [HACKERS] More postmaster troubles
>
>
> > Solaris7 on a Sparc20 running 6.4.2.  Occasionally
> (once or twice a
> > day) under a very light load, brain-dead child
> processes begin to
> > accumulate in my system.  If left unchecked, eventually
> the parent
> > process runs out of resources and dies, orphaning all the lost
> > processes.  (Now that I have solved the semaphore
> error, it appears
> > to be the backend limit of 64 processes.)
>
> Have you installed following patches? This solves the
> problem when #
> of backends reaches MaxBackendId. I'm not sure if your
> problem relates
> to this, though.
>
> -------------------------------- cut here
> ---------------------------
> ***
> postgresql-6.4.2/src/backend/postmaster/postmaster.c.orig
> Sun Nov 29 10:52:32 1998
> --- postgresql-6.4.2/src/backend/postmaster/postmaster.c
> Sat Jan  9 18:14:52 1999
> ***************
> *** 238,243 ****
> --- 238,244 ----
>   static long PostmasterRandom(void);
>   static void RandomSalt(char *salt);
>   static void SignalChildren(SIGNAL_ARGS);
> + static int CountChildren(void);
>
>   #ifdef CYR_RECODE
>   void        GetCharSetByHost(char *, int, char *);
> ***************
> *** 754,764 ****
>                    * by the backend.
>                    */
>
> !                 if (BackendStartup(port) !=
> STATUS_OK)
> !
> PacketSendError(&port->pktInfo,
>
>     "Backend startup failed");
> !                 else
> !                     status = STATUS_ERROR;
>               }
>
>               /* Close the connection if required. */
> --- 755,771 ----
>                    * by the backend.
>                    */
>
> !                                 if (CountChildren() <
> MaxBackendId) {
> !                     if
> (BackendStartup(port) != STATUS_OK)
> !
> PacketSendError(&port->pktInfo,
>
>     "Backend startup failed");
> !                     else {
> !                         status =
> STATUS_ERROR;
> !                     }
> !                 } else {
> !
> PacketSendError(&port->pktInfo,
> !                     "There are too many
> backends");
> !                 }
>               }
>
>               /* Close the connection if required. */
> ***************
> *** 1617,1620 ****
> --- 1624,1655 ----
>       }
>
>       return random() ^ random_seed;
> + }
> +
> + /*
> +  * Count up number of chidren processes.
> +  */
> + static int
> + CountChildren(void)
> + {
> +     Dlelem       *curr,
> +                *next;
> +     Backend    *bp;
> +     int            mypid = getpid();
> +     int    cnt = 0;
> +
> +     curr = DLGetHead(BackendList);
> +     while (curr)
> +     {
> +         next = DLGetSucc(curr);
> +         bp = (Backend *) DLE_VAL(curr);
> +
> +         if (bp->pid != mypid)
> +         {
> +             cnt++;
> +         }
> +
> +         curr = next;
> +     }
> +     return(cnt);
>   }
>

Re: [HACKERS] More postmaster troubles

From

Tom Lane

Date:

13 February 1999, 18:23:47

"Daryl W. Dunbar" <daryl@www.com> writes:
> Thank you Tatsousan.  This patch will solve the dying process
> problem when I reach MaxBackendId (which I increased from 64 to
> 128).  However, I do not know what is causing the spiraling death of
> the processes in the first place. :(

Hmm.  I have noticed at least one place in the code where there is an
undocumented hard-wired dependency on MaxBackendId, to wit MAX_PROC_SEMS
in include/storage/proc.h which is set at 128.  Presumably it should be
equal to MaxBackendId (and I intend to fix that soon).  Evidently that
particular bug is not hurting you (yet) but perhaps there are similar
errors elsewhere that kick in sooner.  Do you see the spiraling-death
problem if you run with MaxBackendId at its customary value of 64?

The log extract you posted before mentions "fputc() failed: errno=32"
which suggests an unexpected client disconnect during a transaction.
I suspect the backend that gets that disconnect is failing to clean up
properly before exiting, and is leaving one or more locks locked.
We don't have enough info yet to track down the cause, but I suggest
we could narrow it down some by seeing whether the problem goes away
with a lower MaxBackendId setting.

(You might also want to work on making your clients more robust,
but I'd like to see if we can solve the backend bug first ...)
        regards, tom lane

RE: [HACKERS] More postmaster troubles

From

"Daryl W. Dunbar"

Date:

13 February 1999, 18:34:41

Tom,

I have to date experienced the problem only with MaxBackendId set to
64.  Today I installed a version of the code with it set to 128
(just picked that number out of luck, but would like to get it
higher).  By the way, I had to tune the kernel to allow me to
increase MaxBackendId, this time in shared memory (SHMMAX).

As for the clients, they are web users via mod_perl/DBI/DBD:Pg.  It
is possible that the user is hitting the stop button right at a time
which hangs the connection (backend), but I have been unable to
reproduce that so far.  That was my first thought on this problem.
The fact that it apparently spirals is disturbing, I highly doubt
there is a user out there hitting the stop key 64 times in a row. :)

Thanks for your help,

DwD

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Saturday, February 13, 1999 3:23 PM
> To: Daryl W. Dunbar
> Cc: pgsql-hackers@postgreSQL.org
> Subject: Re: [HACKERS] More postmaster troubles
>
>
> "Daryl W. Dunbar" <daryl@www.com> writes:
> > Thank you Tatsousan.  This patch will solve the dying process
> > problem when I reach MaxBackendId (which I increased from 64 to
> > 128).  However, I do not know what is causing the
> spiraling death of
> > the processes in the first place. :(
>
> Hmm.  I have noticed at least one place in the code where
> there is an
> undocumented hard-wired dependency on MaxBackendId, to
> wit MAX_PROC_SEMS
> in include/storage/proc.h which is set at 128.
> Presumably it should be
> equal to MaxBackendId (and I intend to fix that soon).
> Evidently that
> particular bug is not hurting you (yet) but perhaps there
> are similar
> errors elsewhere that kick in sooner.  Do you see the
> spiraling-death
> problem if you run with MaxBackendId at its customary value of 64?
>
> The log extract you posted before mentions "fputc()
> failed: errno=32"
> which suggests an unexpected client disconnect during a
> transaction.
> I suspect the backend that gets that disconnect is
> failing to clean up
> properly before exiting, and is leaving one or more locks locked.
> We don't have enough info yet to track down the cause,
> but I suggest
> we could narrow it down some by seeing whether the
> problem goes away
> with a lower MaxBackendId setting.
>
> (You might also want to work on making your clients more robust,
> but I'd like to see if we can solve the backend bug first ...)
>
>             regards, tom lane
>