Thread: LWLockAcquire problems

LWLockAcquire problems

From
Scott Shattuck
Date:
I posted this earlier but it didn't seem to go through. I apologize in
advance if you've gotten this twice. Since I posted earlier this error
happened 2 additional times. It's now a critical issue for our site:



I'm seeing the following error about once a week or so:

2002-08-13 12:37:28 [24313]  FATAL 1:  LWLockAcquire: can't wait without
a PROC structure


It's usually preceded by these:

2002-08-13 12:37:28 [24313]  FATAL 1:  Database "template0" is not
currently accepting connections


And immediately followed by this:

2002-08-13 12:37:28 [12532]  DEBUG:  server process (pid 24313) exited
with exit code 1
2002-08-13 12:37:28 [12532]  DEBUG:  terminating any other active server
processes


All active database processes then immediately do the following:

2002-08-13 12:37:28 [24311]  NOTICE:  Message from PostgreSQL backend:       The Postmaster has informed me that some
otherbackend       died abnormally and possibly corrupted shared memory.       I have rolled back the current
transactionand am       going to terminate your database system connection and exit.       Please reconnect to the
databasesystem and repeat your query.
 


Ouch!

Any idea what's going on here? Is the LWLockAcquire related to something
like the size of the lock table or something? Any help on eliminating
this error would be appreciated. Thanks!


ss

Scott Shattuck
Technical Pursuit Inc.





Re: LWLockAcquire problems

From
Tom Lane
Date:
Scott Shattuck <ss@technicalpursuit.com> writes:
> I'm seeing the following error about once a week or so:
> 2002-08-13 12:37:28 [24313]  FATAL 1:  LWLockAcquire: can't wait without
> a PROC structure

Oh?  I'd love to see what makes this happen.  Can you give more context?

> It's usually preceded by these:
> 2002-08-13 12:37:28 [24313]  FATAL 1:  Database "template0" is not
> currently accepting connections

That's interesting but I'm not sure it proves much.  If I try to connect
to template0 here, I only see the "not currently accepting connections"
message, not any LWLock complaints.  So I think there's more to it...

> And immediately followed by this:
> 2002-08-13 12:37:28 [12532]  DEBUG:  server process (pid 24313) exited
> with exit code 1

That's just the effects of the FATAL 1 exit (see my comments to Tom
O'Connell earlier today).

The "can't wait without a PROC" failure suggests strongly that something
is rotten in very early backend startup --- once MyProc has been set,
it won't happen, and *darn* little happens before MyProc gets set.
But I'm not sure how to proceed beyond that observation.  If you can
offer any context or information at all, it'd be helpful.
        regards, tom lane


Re: LWLockAcquire problems

From
Scott Shattuck
Date:
On Tue, 2002-08-13 at 22:42, Tom Lane wrote:
> Scott Shattuck <ss@technicalpursuit.com> writes:
> > I'm seeing the following error about once a week or so:
> > 2002-08-13 12:37:28 [24313]  FATAL 1:  LWLockAcquire: can't wait without
> > a PROC structure
> 
> Oh?  I'd love to see what makes this happen.  Can you give more context?

I haven't been able to get any detailed correlation on what causes this
over the past week and it's not happening often enough for me to turn on
heavy logging to catch it a second time. The system details I can
provide are:


Solaris 8 running on a 4 CPU box with 4GB main memory.
Postgres 7.2.1 built with optimization flags on and max backends at 512.

Our postgresql.conf file changes are:



shared_buffers = 121600         # 2*max_connections, min 16

max_fsm_relations = 512         # min 10, fsm is free space map

max_fsm_pages = 65536           # min 1000, fsm is free space map

max_locks_per_transaction = 256 # min 10

wal_buffers = 1600              # min 4

sort_mem = 4096                 # min 32

vacuum_mem = 65536              # min 1024

wal_files = 32                  # range 0-64



Because we're still in tuning mode we also changed:

stats_command_string = true

stats_row_level = true

stats_block_level = true


Fsync is true at the moment although we're considering turning that off
based on performance and what appears to be high IO overhead.



The average number of connections during normal operation is fairly low,
roughly 30-50, although lock contention due to foreign key constraints
can cause bottlenecks that push the connection count much higher while
requests queue up waiting for locks to clear.

We run Java-based application servers that do connection pooling and
these seem to be operating properly although it might be possible that
some interaction between PG and the appserver connection pools may be
involved here. I don't have enough understanding of the "*darn* little"
that happens before MyProc gets set to say :).

Sorry I don't have more data but the activity count is high enough that
logging all queries waiting for a crash to happen over a number of days
can create log files that are untenable in our current environment.

Again, any insight or assistance would be greatly appreciated. This is a
high-volume E-commerce application and other than this bug PG has been
rock solid. Eliminating this would get our uptime record where we need
it for long term comfort.


ss


Scott Shattuck
Technical Pursuit Inc.





Re: LWLockAcquire problems

From
Tom Lane
Date:
Scott Shattuck <ss@technicalpursuit.com> writes:
> On Tue, 2002-08-13 at 22:42, Tom Lane wrote:
>> Scott Shattuck <ss@technicalpursuit.com> writes:
> I'm seeing the following error about once a week or so:
> 2002-08-13 12:37:28 [24313]  FATAL 1:  LWLockAcquire: can't wait without
> a PROC structure
>> 
>> Oh?  I'd love to see what makes this happen.  Can you give more context?

> I haven't been able to get any detailed correlation on what causes this
> over the past week and it's not happening often enough for me to turn on
> heavy logging to catch it a second time.

What would actually be useful is a stack backtrace from the point of the
error.  If you are willing, I would suggest replacing the line           elog(FATAL, "LWLockAcquire: can't wait without
aPROC structure");
 
with           abort();
(in 7.2 this is about line 275 of src/backend/storage/lmgr/lwlock.c) so
that a core dump is forced when the error occurs.  Then you could get a
backtrace from the corefile.

The downside of this is that the abort() will cause a database-wide
restart; I can understand if you don't want that to happen in a
production system.  But right at the moment I see no other way to
gather more info ...
        regards, tom lane