Thread: [7.0.2] problems with spinlock under FreeBSD?

[7.0.2] problems with spinlock under FreeBSD?

From
The Hermit Hacker
Date:
We had a crash this morning of our server ... not the machine, just the
postmaster processes ... all three of them spread across three seperate
ports ...

Just looking through logs now, I'm finding:

FATAL: s_lock(20048065) at spin.c:127, stuck spinlock. Aborting.

In one of them ...

I've got a load of core files all from about the same time, from the
various processes running on teh different ports:

714266 9936 -rw-------    1 pgsql            pgsql             5074944 Aug 23 10:32
./data/base/petpostings/postgres.core
944391 9984 -rw-------    1 pgsql            pgsql             5099520 Aug 23 10:32
./data/base/trends_acctng/postgres.core
1015889 9984 -rw-------    1 pgsql            pgsql             5099520 Aug 23 10:33
./data/base/pg_banners/postgres.core
1055605 10768 -rw-------    1 pgsql            pgsql             5505024 Aug 23 10:34
./data/base/rockwell/postgres.core
904800 10032 -rw-------    1 pgsql            pgsql             5124096 Aug 23 10:32 ./data/base/area902/postgres.core
619085 74720 -rw-------    1 pgsql            pgsql            38219776 Jun  7 21:09
./data/base/thtphone/postgres.core
1944360 9936 -rw-------    1 pgsql            pgsql             5074944 Aug 23 10:32 ./data/base/counter/postgres.core
896891 9808 -rw-------    1 pgsql            pgsql             5009408 Aug 23 10:32
./data/base/hub_traf_stats/postgres.core
1849088 20656 -rw-------    1 pgsql            pgsql            10567680 Aug 23 09:56 ./data/base/horde/postgres.core
849311 7136 -rw-------    1 pgsql            pgsql             3645440 Aug 23 12:36
./special/mukesh/base/archies/postgres.core
857377 7104 -rw-------    1 pgsql            pgsql             3629056 Aug 23 12:36
./special/mukesh/base/water/postgres.core8009 44176 -rw-------    1 pgsql            pgsql            22589440 Aug 23
10:39./data2/udmsearch/postgres.core
 

if any of those help any?  

looking in /var/log/messages, I'm seeing the following just before the
crashes:

Aug 23 12:33:47 pgsql syslogd: /dev/console: Too many open files in system: Too many open files in system
Aug 23 12:33:47 pgsql syslogd: /var/run/utmp: Too many open files in system
Aug 23 12:33:47 pgsql /kernel: file: table is full

would this be the cause?  if so, I have to raise some limits, no problem
there, but just want to confirm ...

thanks ...

Marc G. Fournier                   ICQ#7615664               IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org 



Re: [7.0.2] problems with spinlock under FreeBSD?

From
Tom Lane
Date:
The Hermit Hacker <scrappy@hub.org> writes:
> looking in /var/log/messages, I'm seeing the following just before the
> crashes:

> Aug 23 12:33:47 pgsql syslogd: /dev/console: Too many open files in system: Too many open files in system
> Aug 23 12:33:47 pgsql syslogd: /var/run/utmp: Too many open files in system
> Aug 23 12:33:47 pgsql /kernel: file: table is full

That sure looks like you'd better tweak your kernel settings ... but
offhand I don't see how it could lead to "stuck spinlock" errors.
What do you get from gdb backtraces on the corefiles?
        regards, tom lane


Re: [7.0.2] problems with spinlock under FreeBSD?

From
Vince Vielhaber
Date:
On Wed, 23 Aug 2000, The Hermit Hacker wrote:

> 
> We had a crash this morning of our server ... not the machine, just the
> postmaster processes ... all three of them spread across three seperate
> ports ...
> 
> Just looking through logs now, I'm finding:

[snip]

> 
> looking in /var/log/messages, I'm seeing the following just before the
> crashes:
> 
> Aug 23 12:33:47 pgsql syslogd: /dev/console: Too many open files in system: Too many open files in system
> Aug 23 12:33:47 pgsql syslogd: /var/run/utmp: Too many open files in system
> Aug 23 12:33:47 pgsql /kernel: file: table is full
> 
> would this be the cause?  if so, I have to raise some limits, no problem
> there, but just want to confirm ...

What's maxusers set to in the kernel?  If you want to try raising it on
the fly, try  

# sysctl -w kern.maxfiles=abiggernumberthanitisnow

and set it to a bigger number than it is now.  

# sysctl kern.maxfiles

will tell you what it's currently set to.  Whether or not that has
anything to do with the spinlock problem, no idea.

Vince.
-- 
==========================================================================
Vince Vielhaber -- KA8CSH    email: vev@michvhf.com    http://www.pop4.net128K ISDN from $22.00/mo - 56K Dialup from
$16.00/moat Pop4 Networking       Online Campground Directory    http://www.camping-usa.com      Online Giftshop
Superstore   http://www.cloudninegifts.com
 
==========================================================================





Re: [7.0.2] problems with spinlock under FreeBSD?

From
The Hermit Hacker
Date:
On Thu, 24 Aug 2000, Vince Vielhaber wrote:

> On Wed, 23 Aug 2000, The Hermit Hacker wrote:
> 
> > 
> > We had a crash this morning of our server ... not the machine, just the
> > postmaster processes ... all three of them spread across three seperate
> > ports ...
> > 
> > Just looking through logs now, I'm finding:
> 
> [snip]
> 
> > 
> > looking in /var/log/messages, I'm seeing the following just before the
> > crashes:
> > 
> > Aug 23 12:33:47 pgsql syslogd: /dev/console: Too many open files in system: Too many open files in system
> > Aug 23 12:33:47 pgsql syslogd: /var/run/utmp: Too many open files in system
> > Aug 23 12:33:47 pgsql /kernel: file: table is full
> > 
> > would this be the cause?  if so, I have to raise some limits, no problem
> > there, but just want to confirm ...
> 
> What's maxusers set to in the kernel?  If you want to try raising it on

maxusers        128


> the fly, try  
> 
> # sysctl -w kern.maxfiles=abiggernumberthanitisnow
> 
> and set it to a bigger number than it is now.  
> 
> # sysctl kern.maxfiles

just up'd her to 8192 from 4136 ... thanks, forgot about the sysctl value
for this, was dreading having to recompile :(




Re: [7.0.2] problems with spinlock under FreeBSD?

From
The Hermit Hacker
Date:
Attached ...

On Thu, 24 Aug 2000, Tom Lane wrote:

> The Hermit Hacker <scrappy@hub.org> writes:
> > looking in /var/log/messages, I'm seeing the following just before the
> > crashes:
> 
> > Aug 23 12:33:47 pgsql syslogd: /dev/console: Too many open files in system: Too many open files in system
> > Aug 23 12:33:47 pgsql syslogd: /var/run/utmp: Too many open files in system
> > Aug 23 12:33:47 pgsql /kernel: file: table is full
> 
> That sure looks like you'd better tweak your kernel settings ... but
> offhand I don't see how it could lead to "stuck spinlock" errors.
> What do you get from gdb backtraces on the corefiles?
> 
>             regards, tom lane
> 

Marc G. Fournier                   ICQ#7615664               IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org

Re: [7.0.2] problems with spinlock under FreeBSD?

From
Vince Vielhaber
Date:
On Thu, 24 Aug 2000, The Hermit Hacker wrote:

> > the fly, try  
> > 
> > # sysctl -w kern.maxfiles=abiggernumberthanitisnow
> > 
> > and set it to a bigger number than it is now.  
> > 
> > # sysctl kern.maxfiles
> 
> just up'd her to 8192 from 4136 ... thanks, forgot about the sysctl value
> for this, was dreading having to recompile :(

I had to do it a couple of months ago on one of my machines.  It was 
almost at a year of uptime and I didn't want to reboot if I didn't
have to!

Vince.
-- 
==========================================================================
Vince Vielhaber -- KA8CSH    email: vev@michvhf.com    http://www.pop4.net128K ISDN from $22.00/mo - 56K Dialup from
$16.00/moat Pop4 Networking       Online Campground Directory    http://www.camping-usa.com      Online Giftshop
Superstore   http://www.cloudninegifts.com
 
==========================================================================





Re: [7.0.2] problems with spinlock under FreeBSD?

From
Tom Lane
Date:
The Hermit Hacker <scrappy@hub.org> writes:
>> What do you get from gdb backtraces on the corefiles?

> #2  0x80ee847 in s_lock_stuck (lock=0x20048065 "\001", file=0x816723c "spin.c", line=127) at s_lock.c:51
> #3  0x80ee8c3 in s_lock (lock=0x20048065 "\001", file=0x816723c "spin.c", line=127) at s_lock.c:80
> #4  0x80f1580 in SpinAcquire (lockid=7) at spin.c:127
> #5  0x80f3903 in LockRelease (lockmethod=1, locktag=0xbfbfe674, lockmode=1) at lock.c:1044
> #6  0x80f2af9 in UnlockRelation (relation=0x82063f0, lockmode=1) at lmgr.c:178
> #7  0x806f25e in index_endscan (scan=0x8208780) at indexam.c:284

That's interesting ... someone failing to release lock.c's master
spinlock, it looks like.  Do you have anything in the postmaster log
from just before the crashes?
        regards, tom lane


Re: [7.0.2] problems with spinlock under FreeBSD?

From
The Hermit Hacker
Date:
On Thu, 24 Aug 2000, Tom Lane wrote:

> The Hermit Hacker <scrappy@hub.org> writes:
> >> What do you get from gdb backtraces on the corefiles?
> 
> > #2  0x80ee847 in s_lock_stuck (lock=0x20048065 "\001", file=0x816723c "spin.c", line=127) at s_lock.c:51
> > #3  0x80ee8c3 in s_lock (lock=0x20048065 "\001", file=0x816723c "spin.c", line=127) at s_lock.c:80
> > #4  0x80f1580 in SpinAcquire (lockid=7) at spin.c:127
> > #5  0x80f3903 in LockRelease (lockmethod=1, locktag=0xbfbfe674, lockmode=1) at lock.c:1044
> > #6  0x80f2af9 in UnlockRelation (relation=0x82063f0, lockmode=1) at lmgr.c:178
> > #7  0x806f25e in index_endscan (scan=0x8208780) at indexam.c:284
> 
> That's interesting ... someone failing to release lock.c's master
> spinlock, it looks like.  Do you have anything in the postmaster log
> from just before the crashes?

okay, nothing that I can see that is 'unusual' in the log files, but as
shown below, at ~10:30am today, the same thing appears to have happened
...

%ls -lt */*.core
-rw-------  1 pgsql  pgsql  22589440 Aug 23 10:39 udmsearch/postgres.core
-rw-------  1 pgsql  pgsql   5505024 Aug 23 10:34 rockwell/postgres.core
-rw-------  1 pgsql  pgsql   5099520 Aug 23 10:33 pg_banners/postgres.core
-rw-------  1 pgsql  pgsql   5009408 Aug 23 10:32 hub_traf_stats/postgres.core
-rw-------  1 pgsql  pgsql   5099520 Aug 23 10:32 trends_acctng/postgres.core
-rw-------  1 pgsql  pgsql   5124096 Aug 23 10:32 area902/postgres.core
-rw-------  1 pgsql  pgsql   5074944 Aug 23 10:32 petpostings/postgres.core
-rw-------  1 pgsql  pgsql   5074944 Aug 23 10:32 counter/postgres.core
-rw-------  1 pgsql  pgsql  10567680 Aug 23 09:56 horde/postgres.core

Check the gdb on a couple of them:

(gdb) where
#0  0x18271d90 in kill () from /usr/lib/libc.so.4
#1  0x182b2e09 in abort () from /usr/lib/libc.so.4
#2  0x80ee847 in s_lock_stuck (lock=0x20048065 "\001", file=0x816723c "spin.c", line=127) at s_lock.c:51
#3  0x80ee8c3 in s_lock (lock=0x20048065 "\001", file=0x816723c "spin.c", line=127) at s_lock.c:80
#4  0x80f1580 in SpinAcquire (lockid=7) at spin.c:127

(gdb) where
#0  0x18271d90 in kill () from /usr/lib/libc.so.4
#1  0x182b2e09 in abort () from /usr/lib/libc.so.4
#2  0x80ee847 in s_lock_stuck (lock=0x20048065 "\001", file=0x816723c "spin.c", line=127) at s_lock.c:51
#3  0x80ee8c3 in s_lock (lock=0x20048065 "\001", file=0x816723c "spin.c", line=127) at s_lock.c:80
#4  0x80f1580 in SpinAcquire (lockid=7) at spin.c:127

they all appear to be in the same place ...

now, I'm running 4 seperate postmaster daemons, with seperate data
directories, as:

ps ux | grep postmaster | grep 543
pgsql 50554  0.0  0.1  6904  556  p0- I     1:12PM   0:04.88 /pgsql/bin/postmaster -D/pgsql/special/sales.org -i -p
5434(postgres)
 
pgsql 61821  0.0  0.1  7080  636  p6- S     4:38PM   3:03.86 /pgsql/bin/postmaster -B 256 -N 128 -o -F -o
/pgsql/logs/5432.61820-S 32768 -i -p 5432 -D/pgsql/data (postgres)
 
pgsql 62268  0.0  0.0  5488    0  p4- IW   -         0:00.00 /pgsql/bin/postmaster -d 1 -N 16 -o -F -o
/pgsql/logs/5433.62267-S 32768 -i -p 5433 -D/pgsql/special/lee (postgres)
 
pgsql 27084  0.0  0.1  5496  596  p4- S     8:25AM   0:54.11 /pgsql/bin/postmaster -d 1 -N 16 -o -F -o
/pgsql/logs/5437.27083-S 32768 -i -p 5437 -D/pgsql/special/mukesh (postgres)
 

and the above core files are from the one running on 5432 ...

you still have your account on that machine if you want to take a quick
look around ... else, anything else I should be looking at?



Re: [7.0.2] problems with spinlock under FreeBSD?

From
Tom Lane
Date:
The Hermit Hacker <scrappy@hub.org> writes:
> you still have your account on that machine if you want to take a quick
> look around ... else, anything else I should be looking at?

I poked around and couldn't learn much of anything --- the logfiles from
yesterday are already gone, apparently.  I did find some interesting
entries in today's logfiles:

%grep Lru *
5432.61820:FATAL 1:  ReleaseLruFile: No open files available to be closed
postmaster.5437.36290:FATAL 1:  ReleaseLruFile: No open files available to be closed
postmaster.5437.62218:FATAL 1:  ReleaseLruFile: No open files available to be closed

What we see here are backends choking because there are no free kernel
file descriptor slots, even after they've closed *all* of their own
discretionary FDs.  So you've definitely got a serious problem with
insufficient FD slots.  Time to tweak those kernel parameters.

I still don't see a linkage between too few FDs and the stuck-spinlock
crashes, but maybe there is one.
        regards, tom lane


Re: [7.0.2] problems with spinlock under FreeBSD?

From
The Hermit Hacker
Date:
okay, I just doubled my FDs to 8192 from 4136 and will watch things
... anyone know of a way of telling how many are currently in use, and
where they peaked?  somethign similar to 'netstat -m' showing mbufs?

thanks tom ...

On Thu, 24 Aug 2000, Tom Lane wrote:

> The Hermit Hacker <scrappy@hub.org> writes:
> > you still have your account on that machine if you want to take a quick
> > look around ... else, anything else I should be looking at?
> 
> I poked around and couldn't learn much of anything --- the logfiles from
> yesterday are already gone, apparently.  I did find some interesting
> entries in today's logfiles:
> 
> %grep Lru *
> 5432.61820:FATAL 1:  ReleaseLruFile: No open files available to be closed
> postmaster.5437.36290:FATAL 1:  ReleaseLruFile: No open files available to be closed
> postmaster.5437.62218:FATAL 1:  ReleaseLruFile: No open files available to be closed
> 
> What we see here are backends choking because there are no free kernel
> file descriptor slots, even after they've closed *all* of their own
> discretionary FDs.  So you've definitely got a serious problem with
> insufficient FD slots.  Time to tweak those kernel parameters.
> 
> I still don't see a linkage between too few FDs and the stuck-spinlock
> crashes, but maybe there is one.
> 
>             regards, tom lane
> 

Marc G. Fournier                   ICQ#7615664               IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org 



Re: [7.0.2] problems with spinlock under FreeBSD?

From
Vince Vielhaber
Date:
On Thu, 24 Aug 2000, The Hermit Hacker wrote:

> 
> okay, I just doubled my FDs to 8192 from 4136 and will watch things
> ... anyone know of a way of telling how many are currently in use, and
> where they peaked?  somethign similar to 'netstat -m' showing mbufs?

Possibly pstat will do it.  It'll give current but I don't know about
peak.

Vince.
-- 
==========================================================================
Vince Vielhaber -- KA8CSH    email: vev@michvhf.com    http://www.pop4.net128K ISDN from $22.00/mo - 56K Dialup from
$16.00/moat Pop4 Networking       Online Campground Directory    http://www.camping-usa.com      Online Giftshop
Superstore   http://www.cloudninegifts.com
 
==========================================================================





Re: [7.0.2] problems with spinlock under FreeBSD?

From
Oleg Bartunov
Date:
On Thu, 24 Aug 2000, The Hermit Hacker wrote:

> Date: Thu, 24 Aug 2000 16:26:07 -0300 (ADT)
> From: The Hermit Hacker <scrappy@hub.org>
> To: Tom Lane <tgl@sss.pgh.pa.us>
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] [7.0.2] problems with spinlock under FreeBSD? 
> 
> 
> okay, I just doubled my FDs to 8192 from 4136 and will watch things
> ... anyone know of a way of telling how many are currently in use, and
> where they peaked?  somethign similar to 'netstat -m' showing mbufs?


I have on rather busy site running FreeBSD
12:57:38[info]:/home/megera$ sysctl kern.maxfiles
kern.maxfiles: 16424

Did you try  systat -vm ?
Regards,
    Oleg

> 
> thanks tom ...
> 
> On Thu, 24 Aug 2000, Tom Lane wrote:
> 
> > The Hermit Hacker <scrappy@hub.org> writes:
> > > you still have your account on that machine if you want to take a quick
> > > look around ... else, anything else I should be looking at?
> > 
> > I poked around and couldn't learn much of anything --- the logfiles from
> > yesterday are already gone, apparently.  I did find some interesting
> > entries in today's logfiles:
> > 
> > %grep Lru *
> > 5432.61820:FATAL 1:  ReleaseLruFile: No open files available to be closed
> > postmaster.5437.36290:FATAL 1:  ReleaseLruFile: No open files available to be closed
> > postmaster.5437.62218:FATAL 1:  ReleaseLruFile: No open files available to be closed
> > 
> > What we see here are backends choking because there are no free kernel
> > file descriptor slots, even after they've closed *all* of their own
> > discretionary FDs.  So you've definitely got a serious problem with
> > insufficient FD slots.  Time to tweak those kernel parameters.
> > 
> > I still don't see a linkage between too few FDs and the stuck-spinlock
> > crashes, but maybe there is one.
> > 
> >             regards, tom lane
> > 
> 
> Marc G. Fournier                   ICQ#7615664               IRC Nick: Scrappy
> Systems Administrator @ hub.org 
> primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org 
> 

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83