Thread: LWLockAcquire problems
I posted this earlier but it didn't seem to go through. I apologize in advance if you've gotten this twice. Since I posted earlier this error happened 2 additional times. It's now a critical issue for our site: I'm seeing the following error about once a week or so: 2002-08-13 12:37:28 [24313] FATAL 1: LWLockAcquire: can't wait without a PROC structure It's usually preceded by these: 2002-08-13 12:37:28 [24313] FATAL 1: Database "template0" is not currently accepting connections And immediately followed by this: 2002-08-13 12:37:28 [12532] DEBUG: server process (pid 24313) exited with exit code 1 2002-08-13 12:37:28 [12532] DEBUG: terminating any other active server processes All active database processes then immediately do the following: 2002-08-13 12:37:28 [24311] NOTICE: Message from PostgreSQL backend: The Postmaster has informed me that some otherbackend died abnormally and possibly corrupted shared memory. I have rolled back the current transactionand am going to terminate your database system connection and exit. Please reconnect to the databasesystem and repeat your query. Ouch! Any idea what's going on here? Is the LWLockAcquire related to something like the size of the lock table or something? Any help on eliminating this error would be appreciated. Thanks! ss Scott Shattuck Technical Pursuit Inc.
Scott Shattuck <ss@technicalpursuit.com> writes: > I'm seeing the following error about once a week or so: > 2002-08-13 12:37:28 [24313] FATAL 1: LWLockAcquire: can't wait without > a PROC structure Oh? I'd love to see what makes this happen. Can you give more context? > It's usually preceded by these: > 2002-08-13 12:37:28 [24313] FATAL 1: Database "template0" is not > currently accepting connections That's interesting but I'm not sure it proves much. If I try to connect to template0 here, I only see the "not currently accepting connections" message, not any LWLock complaints. So I think there's more to it... > And immediately followed by this: > 2002-08-13 12:37:28 [12532] DEBUG: server process (pid 24313) exited > with exit code 1 That's just the effects of the FATAL 1 exit (see my comments to Tom O'Connell earlier today). The "can't wait without a PROC" failure suggests strongly that something is rotten in very early backend startup --- once MyProc has been set, it won't happen, and *darn* little happens before MyProc gets set. But I'm not sure how to proceed beyond that observation. If you can offer any context or information at all, it'd be helpful. regards, tom lane
On Tue, 2002-08-13 at 22:42, Tom Lane wrote: > Scott Shattuck <ss@technicalpursuit.com> writes: > > I'm seeing the following error about once a week or so: > > 2002-08-13 12:37:28 [24313] FATAL 1: LWLockAcquire: can't wait without > > a PROC structure > > Oh? I'd love to see what makes this happen. Can you give more context? I haven't been able to get any detailed correlation on what causes this over the past week and it's not happening often enough for me to turn on heavy logging to catch it a second time. The system details I can provide are: Solaris 8 running on a 4 CPU box with 4GB main memory. Postgres 7.2.1 built with optimization flags on and max backends at 512. Our postgresql.conf file changes are: shared_buffers = 121600 # 2*max_connections, min 16 max_fsm_relations = 512 # min 10, fsm is free space map max_fsm_pages = 65536 # min 1000, fsm is free space map max_locks_per_transaction = 256 # min 10 wal_buffers = 1600 # min 4 sort_mem = 4096 # min 32 vacuum_mem = 65536 # min 1024 wal_files = 32 # range 0-64 Because we're still in tuning mode we also changed: stats_command_string = true stats_row_level = true stats_block_level = true Fsync is true at the moment although we're considering turning that off based on performance and what appears to be high IO overhead. The average number of connections during normal operation is fairly low, roughly 30-50, although lock contention due to foreign key constraints can cause bottlenecks that push the connection count much higher while requests queue up waiting for locks to clear. We run Java-based application servers that do connection pooling and these seem to be operating properly although it might be possible that some interaction between PG and the appserver connection pools may be involved here. I don't have enough understanding of the "*darn* little" that happens before MyProc gets set to say :). Sorry I don't have more data but the activity count is high enough that logging all queries waiting for a crash to happen over a number of days can create log files that are untenable in our current environment. Again, any insight or assistance would be greatly appreciated. This is a high-volume E-commerce application and other than this bug PG has been rock solid. Eliminating this would get our uptime record where we need it for long term comfort. ss Scott Shattuck Technical Pursuit Inc.
Scott Shattuck <ss@technicalpursuit.com> writes: > On Tue, 2002-08-13 at 22:42, Tom Lane wrote: >> Scott Shattuck <ss@technicalpursuit.com> writes: > I'm seeing the following error about once a week or so: > 2002-08-13 12:37:28 [24313] FATAL 1: LWLockAcquire: can't wait without > a PROC structure >> >> Oh? I'd love to see what makes this happen. Can you give more context? > I haven't been able to get any detailed correlation on what causes this > over the past week and it's not happening often enough for me to turn on > heavy logging to catch it a second time. What would actually be useful is a stack backtrace from the point of the error. If you are willing, I would suggest replacing the line elog(FATAL, "LWLockAcquire: can't wait without aPROC structure"); with abort(); (in 7.2 this is about line 275 of src/backend/storage/lmgr/lwlock.c) so that a core dump is forced when the error occurs. Then you could get a backtrace from the corefile. The downside of this is that the abort() will cause a database-wide restart; I can understand if you don't want that to happen in a production system. But right at the moment I see no other way to gather more info ... regards, tom lane