Thread: [7.0.2] problems with spinlock under FreeBSD?
We had a crash this morning of our server ... not the machine, just the postmaster processes ... all three of them spread across three seperate ports ... Just looking through logs now, I'm finding: FATAL: s_lock(20048065) at spin.c:127, stuck spinlock. Aborting. In one of them ... I've got a load of core files all from about the same time, from the various processes running on teh different ports: 714266 9936 -rw------- 1 pgsql pgsql 5074944 Aug 23 10:32 ./data/base/petpostings/postgres.core 944391 9984 -rw------- 1 pgsql pgsql 5099520 Aug 23 10:32 ./data/base/trends_acctng/postgres.core 1015889 9984 -rw------- 1 pgsql pgsql 5099520 Aug 23 10:33 ./data/base/pg_banners/postgres.core 1055605 10768 -rw------- 1 pgsql pgsql 5505024 Aug 23 10:34 ./data/base/rockwell/postgres.core 904800 10032 -rw------- 1 pgsql pgsql 5124096 Aug 23 10:32 ./data/base/area902/postgres.core 619085 74720 -rw------- 1 pgsql pgsql 38219776 Jun 7 21:09 ./data/base/thtphone/postgres.core 1944360 9936 -rw------- 1 pgsql pgsql 5074944 Aug 23 10:32 ./data/base/counter/postgres.core 896891 9808 -rw------- 1 pgsql pgsql 5009408 Aug 23 10:32 ./data/base/hub_traf_stats/postgres.core 1849088 20656 -rw------- 1 pgsql pgsql 10567680 Aug 23 09:56 ./data/base/horde/postgres.core 849311 7136 -rw------- 1 pgsql pgsql 3645440 Aug 23 12:36 ./special/mukesh/base/archies/postgres.core 857377 7104 -rw------- 1 pgsql pgsql 3629056 Aug 23 12:36 ./special/mukesh/base/water/postgres.core8009 44176 -rw------- 1 pgsql pgsql 22589440 Aug 23 10:39./data2/udmsearch/postgres.core if any of those help any? looking in /var/log/messages, I'm seeing the following just before the crashes: Aug 23 12:33:47 pgsql syslogd: /dev/console: Too many open files in system: Too many open files in system Aug 23 12:33:47 pgsql syslogd: /var/run/utmp: Too many open files in system Aug 23 12:33:47 pgsql /kernel: file: table is full would this be the cause? if so, I have to raise some limits, no problem there, but just want to confirm ... thanks ... Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
The Hermit Hacker <scrappy@hub.org> writes: > looking in /var/log/messages, I'm seeing the following just before the > crashes: > Aug 23 12:33:47 pgsql syslogd: /dev/console: Too many open files in system: Too many open files in system > Aug 23 12:33:47 pgsql syslogd: /var/run/utmp: Too many open files in system > Aug 23 12:33:47 pgsql /kernel: file: table is full That sure looks like you'd better tweak your kernel settings ... but offhand I don't see how it could lead to "stuck spinlock" errors. What do you get from gdb backtraces on the corefiles? regards, tom lane
On Wed, 23 Aug 2000, The Hermit Hacker wrote: > > We had a crash this morning of our server ... not the machine, just the > postmaster processes ... all three of them spread across three seperate > ports ... > > Just looking through logs now, I'm finding: [snip] > > looking in /var/log/messages, I'm seeing the following just before the > crashes: > > Aug 23 12:33:47 pgsql syslogd: /dev/console: Too many open files in system: Too many open files in system > Aug 23 12:33:47 pgsql syslogd: /var/run/utmp: Too many open files in system > Aug 23 12:33:47 pgsql /kernel: file: table is full > > would this be the cause? if so, I have to raise some limits, no problem > there, but just want to confirm ... What's maxusers set to in the kernel? If you want to try raising it on the fly, try # sysctl -w kern.maxfiles=abiggernumberthanitisnow and set it to a bigger number than it is now. # sysctl kern.maxfiles will tell you what it's currently set to. Whether or not that has anything to do with the spinlock problem, no idea. Vince. -- ========================================================================== Vince Vielhaber -- KA8CSH email: vev@michvhf.com http://www.pop4.net128K ISDN from $22.00/mo - 56K Dialup from $16.00/moat Pop4 Networking Online Campground Directory http://www.camping-usa.com Online Giftshop Superstore http://www.cloudninegifts.com ==========================================================================
On Thu, 24 Aug 2000, Vince Vielhaber wrote: > On Wed, 23 Aug 2000, The Hermit Hacker wrote: > > > > > We had a crash this morning of our server ... not the machine, just the > > postmaster processes ... all three of them spread across three seperate > > ports ... > > > > Just looking through logs now, I'm finding: > > [snip] > > > > > looking in /var/log/messages, I'm seeing the following just before the > > crashes: > > > > Aug 23 12:33:47 pgsql syslogd: /dev/console: Too many open files in system: Too many open files in system > > Aug 23 12:33:47 pgsql syslogd: /var/run/utmp: Too many open files in system > > Aug 23 12:33:47 pgsql /kernel: file: table is full > > > > would this be the cause? if so, I have to raise some limits, no problem > > there, but just want to confirm ... > > What's maxusers set to in the kernel? If you want to try raising it on maxusers 128 > the fly, try > > # sysctl -w kern.maxfiles=abiggernumberthanitisnow > > and set it to a bigger number than it is now. > > # sysctl kern.maxfiles just up'd her to 8192 from 4136 ... thanks, forgot about the sysctl value for this, was dreading having to recompile :(
Attached ... On Thu, 24 Aug 2000, Tom Lane wrote: > The Hermit Hacker <scrappy@hub.org> writes: > > looking in /var/log/messages, I'm seeing the following just before the > > crashes: > > > Aug 23 12:33:47 pgsql syslogd: /dev/console: Too many open files in system: Too many open files in system > > Aug 23 12:33:47 pgsql syslogd: /var/run/utmp: Too many open files in system > > Aug 23 12:33:47 pgsql /kernel: file: table is full > > That sure looks like you'd better tweak your kernel settings ... but > offhand I don't see how it could lead to "stuck spinlock" errors. > What do you get from gdb backtraces on the corefiles? > > regards, tom lane > Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
On Thu, 24 Aug 2000, The Hermit Hacker wrote: > > the fly, try > > > > # sysctl -w kern.maxfiles=abiggernumberthanitisnow > > > > and set it to a bigger number than it is now. > > > > # sysctl kern.maxfiles > > just up'd her to 8192 from 4136 ... thanks, forgot about the sysctl value > for this, was dreading having to recompile :( I had to do it a couple of months ago on one of my machines. It was almost at a year of uptime and I didn't want to reboot if I didn't have to! Vince. -- ========================================================================== Vince Vielhaber -- KA8CSH email: vev@michvhf.com http://www.pop4.net128K ISDN from $22.00/mo - 56K Dialup from $16.00/moat Pop4 Networking Online Campground Directory http://www.camping-usa.com Online Giftshop Superstore http://www.cloudninegifts.com ==========================================================================
The Hermit Hacker <scrappy@hub.org> writes: >> What do you get from gdb backtraces on the corefiles? > #2 0x80ee847 in s_lock_stuck (lock=0x20048065 "\001", file=0x816723c "spin.c", line=127) at s_lock.c:51 > #3 0x80ee8c3 in s_lock (lock=0x20048065 "\001", file=0x816723c "spin.c", line=127) at s_lock.c:80 > #4 0x80f1580 in SpinAcquire (lockid=7) at spin.c:127 > #5 0x80f3903 in LockRelease (lockmethod=1, locktag=0xbfbfe674, lockmode=1) at lock.c:1044 > #6 0x80f2af9 in UnlockRelation (relation=0x82063f0, lockmode=1) at lmgr.c:178 > #7 0x806f25e in index_endscan (scan=0x8208780) at indexam.c:284 That's interesting ... someone failing to release lock.c's master spinlock, it looks like. Do you have anything in the postmaster log from just before the crashes? regards, tom lane
On Thu, 24 Aug 2000, Tom Lane wrote: > The Hermit Hacker <scrappy@hub.org> writes: > >> What do you get from gdb backtraces on the corefiles? > > > #2 0x80ee847 in s_lock_stuck (lock=0x20048065 "\001", file=0x816723c "spin.c", line=127) at s_lock.c:51 > > #3 0x80ee8c3 in s_lock (lock=0x20048065 "\001", file=0x816723c "spin.c", line=127) at s_lock.c:80 > > #4 0x80f1580 in SpinAcquire (lockid=7) at spin.c:127 > > #5 0x80f3903 in LockRelease (lockmethod=1, locktag=0xbfbfe674, lockmode=1) at lock.c:1044 > > #6 0x80f2af9 in UnlockRelation (relation=0x82063f0, lockmode=1) at lmgr.c:178 > > #7 0x806f25e in index_endscan (scan=0x8208780) at indexam.c:284 > > That's interesting ... someone failing to release lock.c's master > spinlock, it looks like. Do you have anything in the postmaster log > from just before the crashes? okay, nothing that I can see that is 'unusual' in the log files, but as shown below, at ~10:30am today, the same thing appears to have happened ... %ls -lt */*.core -rw------- 1 pgsql pgsql 22589440 Aug 23 10:39 udmsearch/postgres.core -rw------- 1 pgsql pgsql 5505024 Aug 23 10:34 rockwell/postgres.core -rw------- 1 pgsql pgsql 5099520 Aug 23 10:33 pg_banners/postgres.core -rw------- 1 pgsql pgsql 5009408 Aug 23 10:32 hub_traf_stats/postgres.core -rw------- 1 pgsql pgsql 5099520 Aug 23 10:32 trends_acctng/postgres.core -rw------- 1 pgsql pgsql 5124096 Aug 23 10:32 area902/postgres.core -rw------- 1 pgsql pgsql 5074944 Aug 23 10:32 petpostings/postgres.core -rw------- 1 pgsql pgsql 5074944 Aug 23 10:32 counter/postgres.core -rw------- 1 pgsql pgsql 10567680 Aug 23 09:56 horde/postgres.core Check the gdb on a couple of them: (gdb) where #0 0x18271d90 in kill () from /usr/lib/libc.so.4 #1 0x182b2e09 in abort () from /usr/lib/libc.so.4 #2 0x80ee847 in s_lock_stuck (lock=0x20048065 "\001", file=0x816723c "spin.c", line=127) at s_lock.c:51 #3 0x80ee8c3 in s_lock (lock=0x20048065 "\001", file=0x816723c "spin.c", line=127) at s_lock.c:80 #4 0x80f1580 in SpinAcquire (lockid=7) at spin.c:127 (gdb) where #0 0x18271d90 in kill () from /usr/lib/libc.so.4 #1 0x182b2e09 in abort () from /usr/lib/libc.so.4 #2 0x80ee847 in s_lock_stuck (lock=0x20048065 "\001", file=0x816723c "spin.c", line=127) at s_lock.c:51 #3 0x80ee8c3 in s_lock (lock=0x20048065 "\001", file=0x816723c "spin.c", line=127) at s_lock.c:80 #4 0x80f1580 in SpinAcquire (lockid=7) at spin.c:127 they all appear to be in the same place ... now, I'm running 4 seperate postmaster daemons, with seperate data directories, as: ps ux | grep postmaster | grep 543 pgsql 50554 0.0 0.1 6904 556 p0- I 1:12PM 0:04.88 /pgsql/bin/postmaster -D/pgsql/special/sales.org -i -p 5434(postgres) pgsql 61821 0.0 0.1 7080 636 p6- S 4:38PM 3:03.86 /pgsql/bin/postmaster -B 256 -N 128 -o -F -o /pgsql/logs/5432.61820-S 32768 -i -p 5432 -D/pgsql/data (postgres) pgsql 62268 0.0 0.0 5488 0 p4- IW - 0:00.00 /pgsql/bin/postmaster -d 1 -N 16 -o -F -o /pgsql/logs/5433.62267-S 32768 -i -p 5433 -D/pgsql/special/lee (postgres) pgsql 27084 0.0 0.1 5496 596 p4- S 8:25AM 0:54.11 /pgsql/bin/postmaster -d 1 -N 16 -o -F -o /pgsql/logs/5437.27083-S 32768 -i -p 5437 -D/pgsql/special/mukesh (postgres) and the above core files are from the one running on 5432 ... you still have your account on that machine if you want to take a quick look around ... else, anything else I should be looking at?
The Hermit Hacker <scrappy@hub.org> writes: > you still have your account on that machine if you want to take a quick > look around ... else, anything else I should be looking at? I poked around and couldn't learn much of anything --- the logfiles from yesterday are already gone, apparently. I did find some interesting entries in today's logfiles: %grep Lru * 5432.61820:FATAL 1: ReleaseLruFile: No open files available to be closed postmaster.5437.36290:FATAL 1: ReleaseLruFile: No open files available to be closed postmaster.5437.62218:FATAL 1: ReleaseLruFile: No open files available to be closed What we see here are backends choking because there are no free kernel file descriptor slots, even after they've closed *all* of their own discretionary FDs. So you've definitely got a serious problem with insufficient FD slots. Time to tweak those kernel parameters. I still don't see a linkage between too few FDs and the stuck-spinlock crashes, but maybe there is one. regards, tom lane
okay, I just doubled my FDs to 8192 from 4136 and will watch things ... anyone know of a way of telling how many are currently in use, and where they peaked? somethign similar to 'netstat -m' showing mbufs? thanks tom ... On Thu, 24 Aug 2000, Tom Lane wrote: > The Hermit Hacker <scrappy@hub.org> writes: > > you still have your account on that machine if you want to take a quick > > look around ... else, anything else I should be looking at? > > I poked around and couldn't learn much of anything --- the logfiles from > yesterday are already gone, apparently. I did find some interesting > entries in today's logfiles: > > %grep Lru * > 5432.61820:FATAL 1: ReleaseLruFile: No open files available to be closed > postmaster.5437.36290:FATAL 1: ReleaseLruFile: No open files available to be closed > postmaster.5437.62218:FATAL 1: ReleaseLruFile: No open files available to be closed > > What we see here are backends choking because there are no free kernel > file descriptor slots, even after they've closed *all* of their own > discretionary FDs. So you've definitely got a serious problem with > insufficient FD slots. Time to tweak those kernel parameters. > > I still don't see a linkage between too few FDs and the stuck-spinlock > crashes, but maybe there is one. > > regards, tom lane > Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
On Thu, 24 Aug 2000, The Hermit Hacker wrote: > > okay, I just doubled my FDs to 8192 from 4136 and will watch things > ... anyone know of a way of telling how many are currently in use, and > where they peaked? somethign similar to 'netstat -m' showing mbufs? Possibly pstat will do it. It'll give current but I don't know about peak. Vince. -- ========================================================================== Vince Vielhaber -- KA8CSH email: vev@michvhf.com http://www.pop4.net128K ISDN from $22.00/mo - 56K Dialup from $16.00/moat Pop4 Networking Online Campground Directory http://www.camping-usa.com Online Giftshop Superstore http://www.cloudninegifts.com ==========================================================================
On Thu, 24 Aug 2000, The Hermit Hacker wrote: > Date: Thu, 24 Aug 2000 16:26:07 -0300 (ADT) > From: The Hermit Hacker <scrappy@hub.org> > To: Tom Lane <tgl@sss.pgh.pa.us> > Cc: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] [7.0.2] problems with spinlock under FreeBSD? > > > okay, I just doubled my FDs to 8192 from 4136 and will watch things > ... anyone know of a way of telling how many are currently in use, and > where they peaked? somethign similar to 'netstat -m' showing mbufs? I have on rather busy site running FreeBSD 12:57:38[info]:/home/megera$ sysctl kern.maxfiles kern.maxfiles: 16424 Did you try systat -vm ? Regards, Oleg > > thanks tom ... > > On Thu, 24 Aug 2000, Tom Lane wrote: > > > The Hermit Hacker <scrappy@hub.org> writes: > > > you still have your account on that machine if you want to take a quick > > > look around ... else, anything else I should be looking at? > > > > I poked around and couldn't learn much of anything --- the logfiles from > > yesterday are already gone, apparently. I did find some interesting > > entries in today's logfiles: > > > > %grep Lru * > > 5432.61820:FATAL 1: ReleaseLruFile: No open files available to be closed > > postmaster.5437.36290:FATAL 1: ReleaseLruFile: No open files available to be closed > > postmaster.5437.62218:FATAL 1: ReleaseLruFile: No open files available to be closed > > > > What we see here are backends choking because there are no free kernel > > file descriptor slots, even after they've closed *all* of their own > > discretionary FDs. So you've definitely got a serious problem with > > insufficient FD slots. Time to tweak those kernel parameters. > > > > I still don't see a linkage between too few FDs and the stuck-spinlock > > crashes, but maybe there is one. > > > > regards, tom lane > > > > Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy > Systems Administrator @ hub.org > primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org > _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83