Thread: PostgreSQL crash on Freebsd 7
Hello. I have problems with Postgres core dumping on FreeBSD7 (RELENG_7) Here is backtrace from gdb postgres postgres.core: (gdb) bt #0 0x485dc277 in kill () from /lib/libc.so.7 #1 0x485dc1d6 in raise () from /lib/libc.so.7 #2 0x485dadda in abort () from /lib/libc.so.7 #3 0x0824c075 in errfinish () #4 0x0824c8b1 in elog_finish () #5 0x081c9184 in s_lock () #6 0x081c8d48 in LWLockAcquire () #7 0x081c61ec in LockAcquire () #8 0x081c4289 in LockRelationOid () #9 0x080938fc in relation_open () #10 0x08096d5a in index_open () #11 0x08096139 in systable_beginscan () #12 0x08134f10 in RelationBuildTriggers () #13 0x08245d4d in RelationCacheInitializePhase2 () #14 0x08256af0 in InitPostgres () #15 0x081cfd13 in PostgresMain () #16 0x081a90ec in ClosePostmasterPorts () #17 0x081a9ea7 in PostmasterMain () #18 0x0816912f in main () Extract from dmesg: pid 30622 (postgres), uid 70: exited on signal 6 (core dumped) Nothing interesting in other logs. I run FreeBSD 7.0-BETA1 on Dual-Core AMD Opteron(tm) Processor 2216 (2394.01-MHz 686-class CPU) with ULE scheduler PostgreSQL 8.2.5 I can't find what triggers this behavior (it started core dumping after upgrading from FreeBSD 6.2) Anyone have solution for this problem? Michael
Michael wrote: > Extract from dmesg: > pid 30622 (postgres), uid 70: exited on signal 6 (core dumped) > > Nothing interesting in other logs. > > I run FreeBSD 7.0-BETA1 on Dual-Core AMD Opteron(tm) Processor 2216 > (2394.01-MHz 686-class CPU) with ULE scheduler > PostgreSQL 8.2.5 > > I can't find what triggers this behavior (it started core dumping > after upgrading from FreeBSD 6.2) This probably means that the spinlock support is not up to speed for your platform. It is strange though -- I think I've seen other people using FreeBSD 7. I don't see any on the buildfarm: http://buildfarm.postgresql.org/cgi-bin/show_status.pl It probably means you'll need to do some hacking to make it work again. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Michael <michael@gameservice.ru> writes: > Here is backtrace from gdb postgres postgres.core: > (gdb) bt > #0 0x485dc277 in kill () from /lib/libc.so.7 > #1 0x485dc1d6 in raise () from /lib/libc.so.7 > #2 0x485dadda in abort () from /lib/libc.so.7 > #3 0x0824c075 in errfinish () > #4 0x0824c8b1 in elog_finish () > #5 0x081c9184 in s_lock () > #6 0x081c8d48 in LWLockAcquire () > #7 0x081c61ec in LockAcquire () Apparently s_lock_stuck ... though you might want to look at postmaster's stderr output to confirm that. > I run FreeBSD 7.0-BETA1 on Dual-Core AMD Opteron(tm) Processor 2216 > (2394.01-MHz 686-class CPU) with ULE scheduler > PostgreSQL 8.2.5 > I can't find what triggers this behavior (it started core dumping > after upgrading from FreeBSD 6.2) Did you recompile Postgres? Maybe you need to. I dunno what the differences are between 6.2 and 7 ... regards, tom lane
TL> Apparently s_lock_stuck ... though you might want to look at TL> postmaster's stderr output to confirm that. Yes, you are right 2007-10-25 23:37:12 MSD (u=picred,db=picred)PANIC: stuck spinlock (0x4880c3b0) detected at lwlock.c:379 TL> Did you recompile Postgres? Maybe you need to. I dunno what the TL> differences are between 6.2 and 7 ... Yes. Michael
Michael <michael@gameservice.ru> writes: > 2007-10-25 23:37:12 MSD (u=picred,db=picred)PANIC: stuck spinlock (0x4880c3b0) detected at lwlock.c:379 You said this was an Opteron? Why is it printing only 32-bit addresses? > TL> Did you recompile Postgres? Maybe you need to. I dunno what the > TL> differences are between 6.2 and 7 ... > Yes. I'm thinking the rebuild broke somehow ... on the strength of the above, maybe it's partially 32 and partially 64 bits. This could have been pilot error on your part, or maybe FBSD7 wants some new/different compile or link switches that our configuration code doesn't know about. Did you rebuild in a pre-existing PG build tree? If so, that might have resulted in a partial rebuild that could create such a problem. I'd suggest "make distclean", reconfigure, rebuild before you waste any further human effort on the problem... A slightly different thought is that you're likely using a beta gcc release that has maybe got bugs. If decreasing the -O level helps, I'd suspect that. regards, tom lane
TL> You said this was an Opteron? Why is it printing only 32-bit addresses? Yes, i'm using it in 32-bit mode TL> Did you rebuild in a pre-existing PG build tree? If so, that might TL> have resulted in a partial rebuild that could create such a problem. TL> I'd suggest "make distclean", reconfigure, rebuild before you waste TL> any further human effort on the problem... I did portupgrade -fa, this command rebuilds all ports. I'll try to recompile manually. TL> A slightly different thought is that you're likely using a beta gcc TL> release that has maybe got bugs. If decreasing the -O level helps, TL> I'd suspect that. gcc (GCC) 4.2.1 20070719 [FreeBSD] i will try to compile without optimization Michael
I tried a clean rebuild as Tom Lane suggested, but this didn't help. Anyone offers commercial support for solving this problem? M> Hello. M> I have problems with Postgres core dumping on FreeBSD7 (RELENG_7) M> Here is backtrace from gdb postgres postgres.core: M> (gdb) bt M> #0 0x485dc277 in kill () from /lib/libc.so.7 M> #1 0x485dc1d6 in raise () from /lib/libc.so.7 M> #2 0x485dadda in abort () from /lib/libc.so.7 M> #3 0x0824c075 in errfinish () M> #4 0x0824c8b1 in elog_finish () M> #5 0x081c9184 in s_lock () M> #6 0x081c8d48 in LWLockAcquire () M> #7 0x081c61ec in LockAcquire () M> #8 0x081c4289 in LockRelationOid () M> #9 0x080938fc in relation_open () M> #10 0x08096d5a in index_open () M> #11 0x08096139 in systable_beginscan () M> #12 0x08134f10 in RelationBuildTriggers () M> #13 0x08245d4d in RelationCacheInitializePhase2 () M> #14 0x08256af0 in InitPostgres () M> #15 0x081cfd13 in PostgresMain () M> #16 0x081a90ec in ClosePostmasterPorts () M> #17 0x081a9ea7 in PostmasterMain () M> #18 0x0816912f in main () M> Extract from dmesg: M> pid 30622 (postgres), uid 70: exited on signal 6 (core dumped) M> Nothing interesting in other logs. M> I run FreeBSD 7.0-BETA1 on Dual-Core AMD Opteron(tm) Processor 2216 M> (2394.01-MHz 686-class CPU) with ULE scheduler M> PostgreSQL 8.2.5 M> I can't find what triggers this behavior (it started core dumping M> after upgrading from FreeBSD 6.2) M> Anyone have solution for this problem? M> Michael M> ---------------------------(end of M> broadcast)--------------------------- M> TIP 7: You can help support the PostgreSQL project by donating at M> http://www.postgresql.org/about/donate Michael
Michael <michael@gameservice.ru> writes: > M> (gdb) bt > M> #0 0x485dc277 in kill () from /lib/libc.so.7 > M> #1 0x485dc1d6 in raise () from /lib/libc.so.7 > M> #2 0x485dadda in abort () from /lib/libc.so.7 > M> #3 0x0824c075 in errfinish () > M> #4 0x0824c8b1 in elog_finish () > M> #5 0x081c9184 in s_lock () > M> #6 0x081c8d48 in LWLockAcquire () > M> #7 0x081c61ec in LockAcquire () > M> #8 0x081c4289 in LockRelationOid () > M> #9 0x080938fc in relation_open () > M> #10 0x08096d5a in index_open () > M> #11 0x08096139 in systable_beginscan () > M> #12 0x08134f10 in RelationBuildTriggers () > M> #13 0x08245d4d in RelationCacheInitializePhase2 () > M> #14 0x08256af0 in InitPostgres () > M> #15 0x081cfd13 in PostgresMain () > M> #16 0x081a90ec in ClosePostmasterPorts () > M> #17 0x081a9ea7 in PostmasterMain () > M> #18 0x0816912f in main () On closer look ... there is something awfully strange about this backtrace. If it's gotten as far as RelationBuildTriggers, then this is not the first spinlock acquisition in the life of this backend, nor the first LWLockAcquire, nor even the first time to re-acquire a previously released LWLock. Not to mention that the startup process must've successfully done such things too. That seems to eliminate all of the simple theories about how spinlocks might be broken. How repeatable is this --- does it happen on every connection attempt, or only sometimes? Can you start and stop the postmaster without any problems being logged? regards, tom lane
TL> How repeatable is this --- does it happen on every connection attempt, TL> or only sometimes? Can you start and stop the postmaster without TL> any problems being logged? Only sometimes, 1-4 times per day under high load. Postmaster starts and stops without problems. Backtraces are a bit different from time to time, here is last: (gdb) bt #0 0x485d8277 in kill () from /lib/libc.so.7 #1 0x485d81d6 in raise () from /lib/libc.so.7 #2 0x485d6dda in abort () from /lib/libc.so.7 #3 0x0824694e in errfinish () #4 0x08247a43 in elog_finish () #5 0x081c565e in s_lock () #6 0x081c522e in LWLockAcquire () #7 0x081c15ff in LockRelease () #8 0x081c03d3 in UnlockRelationId () #9 0x08096824 in index_close () #10 0x08095afe in systable_endscan () #11 0x08131a88 in RelationBuildTriggers () #12 0x08241bc8 in RelationCacheInitializePhase2 () #13 0x08251afd in InitPostgres () #14 0x081cc789 in PostgresMain () #15 0x081a5270 in ClosePostmasterPorts () #16 0x081a6741 in PostmasterMain () #17 0x081650f2 in main () Michael
do you have a repeatable test case? I have a FreeBSD 7/amd64 box that I can do the following: 1) make test runs 2) make available to a developer. -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 512-248-2683 E-Mail: ler@lerctr.org US Mail: 430 Valona Loop, Round Rock, TX 78681-3893 -----Original Message----- From: pgsql-bugs-owner@postgresql.org [mailto:pgsql-bugs-owner@postgresql.org] On Behalf Of Michael Sent: Thursday, November 01, 2007 4:24 PM To: Tom Lane Subject: Re: [BUGS] PostgreSQL crash on Freebsd 7 TL> How repeatable is this --- does it happen on every connection attempt, TL> or only sometimes? Can you start and stop the postmaster without TL> any problems being logged? Only sometimes, 1-4 times per day under high load. Postmaster starts and stops without problems. Backtraces are a bit different from time to time, here is last: (gdb) bt #0 0x485d8277 in kill () from /lib/libc.so.7 #1 0x485d81d6 in raise () from /lib/libc.so.7 #2 0x485d6dda in abort () from /lib/libc.so.7 #3 0x0824694e in errfinish () #4 0x08247a43 in elog_finish () #5 0x081c565e in s_lock () #6 0x081c522e in LWLockAcquire () #7 0x081c15ff in LockRelease () #8 0x081c03d3 in UnlockRelationId () #9 0x08096824 in index_close () #10 0x08095afe in systable_endscan () #11 0x08131a88 in RelationBuildTriggers () #12 0x08241bc8 in RelationCacheInitializePhase2 () #13 0x08251afd in InitPostgres () #14 0x081cc789 in PostgresMain () #15 0x081a5270 in ClosePostmasterPorts () #16 0x081a6741 in PostmasterMain () #17 0x081650f2 in main () Michael ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend
Michael <michael@gameservice.ru> writes: > TL> How repeatable is this --- does it happen on every connection attempt, > TL> or only sometimes? Can you start and stop the postmaster without > TL> any problems being logged? > Only sometimes, 1-4 times per day under high load. Postmaster starts > and stops without problems. You should have been clear about that to start with, because it changes the likely nature of the problem entirely. > Backtraces are a bit different from time to time, here is last: Hmm, are they always within InitPostgres? That would be a bit odd, because I can't see any reason why a recently-started process would be more prone to a transient spinlock problem than any other process. What seems like a reasonable bet at this point is that the FBSD7 kernel's scheduler has been changed in a way that makes it possible for it to sometimes not schedule a process for a very long time (order of a couple minutes). If that happened while the process was holding a spinlock then other processes waiting to get the spinlock would fail like this. Since we don't hold spinlocks long --- the maximum hold time is supposed to be no more than a couple dozen instructions --- the probability of this would be low. But under sufficient load maybe you'd see it a few times a day. (What is "high load" to you, anyway?) A different theory, given that you said you're using a dual-core machine, is that we're seeing the effects of the two CPUs' caches somehow getting out of sync. I could believe that a kernel problem could cause that; wrong settings in the hardware page tables, for instance. Dunno if you can afford the performance hit, but it would be interesting to run for awhile with only one CPU active and see if the problem still occurs. Anyway, I think you probably need to get some FBSD kernel hackers involved, because this sounds to me like it's their bug in one way or another. Particularly since I now notice you mentioned that FBSD7 is only at beta1 stage ... regards, tom lane