Thread: More postmaster troubles
Hello again, Thanks again to those who pointed me to the semaphore problem. I, unfortunately have another problem: Solaris7 on a Sparc20 running 6.4.2. Occasionally (once or twice a day) under a very light load, brain-dead child processes begin to accumulate in my system. If left unchecked, eventually the parent process runs out of resources and dies, orphaning all the lost processes. (Now that I have solved the semaphore error, it appears to be the backend limit of 64 processes.) Here is a snapshot of truss on some of the processes: # truss -p 5879 semop(259915776, 0xEFFFC560, 1) (sleeping...) # truss -p 5912 semop(259915776, 0xEFFFC190, 1) (sleeping...) # truss -p 5915 semop(259915776, 0xEFFFC190, 1) (sleeping...) # truss -p 5931 semop(259915776, 0xEFFFC280, 1) (sleeping...) # truss -p 5926 semop(259915776, 0xEFFFC280, 1) (sleeping...) They all appear to be waiting on a semaphore operation which apparently never happens. The number of stalled processes grows rapidly (it has gone from 12 to 21 while I wrote this e-mail). The stalled processes all started between 6:57am PST and 7:18am PST, here is what postmaster wrote to the log: Feb 12 06:56:46 constantinople POSTMASTER: FATAL: pq_putnchar: fputc() failed: errno=32 Feb 12 06:57:42 constantinople POSTMASTER: NOTICE: Deadlock detected -- See the lock(l) manual page for a possible cause. Feb 12 06:57:42 constantinople POSTMASTER: ERROR: WaitOnLock: error on wakeup - Aborting this transaction Feb 12 06:57:42 constantinople POSTMASTER: NOTICE: Deadlock detected -- See the lock(l) manual page for a possible cause. Feb 12 06:57:42 constantinople POSTMASTER: ERROR: WaitOnLock: error on wakeup - Aborting this transaction Feb 12 07:02:18 constantinople POSTMASTER: FATAL: pq_putnchar: fputc() failed: errno=32 Feb 12 07:02:19 constantinople last message repeated 2 times Most of the time, things just work, but it appears that once somethins has gone awry, I experience a spiraling death. Thoughts? Suggestions? Help? :) DwD -- Daryl W. Dunbar http://www.com, Where the Web Begins! mailto:daryl@www.com
> Solaris7 on a Sparc20 running 6.4.2. Occasionally (once or twice a > day) under a very light load, brain-dead child processes begin to > accumulate in my system. If left unchecked, eventually the parent > process runs out of resources and dies, orphaning all the lost > processes. (Now that I have solved the semaphore error, it appears > to be the backend limit of 64 processes.) Have you installed following patches? This solves the problem when # of backends reaches MaxBackendId. I'm not sure if your problem relates to this, though. -------------------------------- cut here --------------------------- *** postgresql-6.4.2/src/backend/postmaster/postmaster.c.orig Sun Nov 29 10:52:32 1998 --- postgresql-6.4.2/src/backend/postmaster/postmaster.c Sat Jan 9 18:14:52 1999 *************** *** 238,243 **** --- 238,244 ---- static long PostmasterRandom(void); static void RandomSalt(char *salt); static void SignalChildren(SIGNAL_ARGS); + static int CountChildren(void); #ifdef CYR_RECODE void GetCharSetByHost(char *, int, char *); *************** *** 754,764 **** * by the backend. */ ! if (BackendStartup(port) != STATUS_OK) ! PacketSendError(&port->pktInfo, "Backend startup failed"); ! else ! status = STATUS_ERROR; } /* Close the connection if required. */ --- 755,771 ---- * by the backend. */ ! if (CountChildren() < MaxBackendId) { ! if (BackendStartup(port) != STATUS_OK) ! PacketSendError(&port->pktInfo, "Backend startup failed"); ! else { ! status = STATUS_ERROR; ! } ! } else { ! PacketSendError(&port->pktInfo, ! "There are too many backends"); ! } } /* Close the connection if required. */ *************** *** 1617,1620 **** --- 1624,1655 ---- } return random() ^ random_seed; + } + + /* + * Count up number of chidren processes. + */ + static int + CountChildren(void) + { + Dlelem *curr, + *next; + Backend *bp; + int mypid = getpid(); + int cnt = 0; + + curr = DLGetHead(BackendList); + while (curr) + { + next = DLGetSucc(curr); + bp = (Backend *) DLE_VAL(curr); + + if (bp->pid != mypid) + { + cnt++; + } + + curr = next; + } + return(cnt); }
Thank you Tatsousan. This patch will solve the dying process problem when I reach MaxBackendId (which I increased from 64 to 128). However, I do not know what is causing the spiraling death of the processes in the first place. :( Is there some place I should be looking for other patches, besides those listed on www.postgresql.org? Thank you for your continued help. DwD > -----Original Message----- > From: t-ishii@ext16.sra.co.jp > [mailto:t-ishii@ext16.sra.co.jp]On Behalf > Of Tatsuo Ishii > Sent: Saturday, February 13, 1999 1:03 AM > To: Daryl W. Dunbar > Cc: pgsql-hackers@postgreSQL. org > Subject: Re: [HACKERS] More postmaster troubles > > > > Solaris7 on a Sparc20 running 6.4.2. Occasionally > (once or twice a > > day) under a very light load, brain-dead child > processes begin to > > accumulate in my system. If left unchecked, eventually > the parent > > process runs out of resources and dies, orphaning all the lost > > processes. (Now that I have solved the semaphore > error, it appears > > to be the backend limit of 64 processes.) > > Have you installed following patches? This solves the > problem when # > of backends reaches MaxBackendId. I'm not sure if your > problem relates > to this, though. > > -------------------------------- cut here > --------------------------- > *** > postgresql-6.4.2/src/backend/postmaster/postmaster.c.orig > Sun Nov 29 10:52:32 1998 > --- postgresql-6.4.2/src/backend/postmaster/postmaster.c > Sat Jan 9 18:14:52 1999 > *************** > *** 238,243 **** > --- 238,244 ---- > static long PostmasterRandom(void); > static void RandomSalt(char *salt); > static void SignalChildren(SIGNAL_ARGS); > + static int CountChildren(void); > > #ifdef CYR_RECODE > void GetCharSetByHost(char *, int, char *); > *************** > *** 754,764 **** > * by the backend. > */ > > ! if (BackendStartup(port) != > STATUS_OK) > ! > PacketSendError(&port->pktInfo, > > "Backend startup failed"); > ! else > ! status = STATUS_ERROR; > } > > /* Close the connection if required. */ > --- 755,771 ---- > * by the backend. > */ > > ! if (CountChildren() < > MaxBackendId) { > ! if > (BackendStartup(port) != STATUS_OK) > ! > PacketSendError(&port->pktInfo, > > "Backend startup failed"); > ! else { > ! status = > STATUS_ERROR; > ! } > ! } else { > ! > PacketSendError(&port->pktInfo, > ! "There are too many > backends"); > ! } > } > > /* Close the connection if required. */ > *************** > *** 1617,1620 **** > --- 1624,1655 ---- > } > > return random() ^ random_seed; > + } > + > + /* > + * Count up number of chidren processes. > + */ > + static int > + CountChildren(void) > + { > + Dlelem *curr, > + *next; > + Backend *bp; > + int mypid = getpid(); > + int cnt = 0; > + > + curr = DLGetHead(BackendList); > + while (curr) > + { > + next = DLGetSucc(curr); > + bp = (Backend *) DLE_VAL(curr); > + > + if (bp->pid != mypid) > + { > + cnt++; > + } > + > + curr = next; > + } > + return(cnt); > } >
"Daryl W. Dunbar" <daryl@www.com> writes: > Thank you Tatsousan. This patch will solve the dying process > problem when I reach MaxBackendId (which I increased from 64 to > 128). However, I do not know what is causing the spiraling death of > the processes in the first place. :( Hmm. I have noticed at least one place in the code where there is an undocumented hard-wired dependency on MaxBackendId, to wit MAX_PROC_SEMS in include/storage/proc.h which is set at 128. Presumably it should be equal to MaxBackendId (and I intend to fix that soon). Evidently that particular bug is not hurting you (yet) but perhaps there are similar errors elsewhere that kick in sooner. Do you see the spiraling-death problem if you run with MaxBackendId at its customary value of 64? The log extract you posted before mentions "fputc() failed: errno=32" which suggests an unexpected client disconnect during a transaction. I suspect the backend that gets that disconnect is failing to clean up properly before exiting, and is leaving one or more locks locked. We don't have enough info yet to track down the cause, but I suggest we could narrow it down some by seeing whether the problem goes away with a lower MaxBackendId setting. (You might also want to work on making your clients more robust, but I'd like to see if we can solve the backend bug first ...) regards, tom lane
Tom, I have to date experienced the problem only with MaxBackendId set to 64. Today I installed a version of the code with it set to 128 (just picked that number out of luck, but would like to get it higher). By the way, I had to tune the kernel to allow me to increase MaxBackendId, this time in shared memory (SHMMAX). As for the clients, they are web users via mod_perl/DBI/DBD:Pg. It is possible that the user is hitting the stop button right at a time which hangs the connection (backend), but I have been unable to reproduce that so far. That was my first thought on this problem. The fact that it apparently spirals is disturbing, I highly doubt there is a user out there hitting the stop key 64 times in a row. :) Thanks for your help, DwD > -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: Saturday, February 13, 1999 3:23 PM > To: Daryl W. Dunbar > Cc: pgsql-hackers@postgreSQL.org > Subject: Re: [HACKERS] More postmaster troubles > > > "Daryl W. Dunbar" <daryl@www.com> writes: > > Thank you Tatsousan. This patch will solve the dying process > > problem when I reach MaxBackendId (which I increased from 64 to > > 128). However, I do not know what is causing the > spiraling death of > > the processes in the first place. :( > > Hmm. I have noticed at least one place in the code where > there is an > undocumented hard-wired dependency on MaxBackendId, to > wit MAX_PROC_SEMS > in include/storage/proc.h which is set at 128. > Presumably it should be > equal to MaxBackendId (and I intend to fix that soon). > Evidently that > particular bug is not hurting you (yet) but perhaps there > are similar > errors elsewhere that kick in sooner. Do you see the > spiraling-death > problem if you run with MaxBackendId at its customary value of 64? > > The log extract you posted before mentions "fputc() > failed: errno=32" > which suggests an unexpected client disconnect during a > transaction. > I suspect the backend that gets that disconnect is > failing to clean up > properly before exiting, and is leaving one or more locks locked. > We don't have enough info yet to track down the cause, > but I suggest > we could narrow it down some by seeing whether the > problem goes away > with a lower MaxBackendId setting. > > (You might also want to work on making your clients more robust, > but I'd like to see if we can solve the backend bug first ...) > > regards, tom lane >