Odd db lockup - investigation advice wanted - Mailing list pgsql-hackers

From Marc Munro
Subject Odd db lockup - investigation advice wanted
Date
Msg-id 1131386995.28004.24.camel@bloodnok.com
Whole thread Raw
Responses Re: Odd db lockup - investigation advice wanted
List pgsql-hackers
Last week I managed to lock-up and then crash a development database.
I'm going to try to reproduce it today and would like to know what I can
do to further investigate the problem.

I am running Linux 2.6.9-11.ELsmp X86_64 on a Quad Dual-Core Opteron

I have the following postgres RPMs installed:

postgresql-libs-7.4.7-2.RHEL4.1
postgresql-contrib-8.0.3-1PGDG
postgresql-libs-8.0.3-1PGDG
postgresql-server-8.0.3-1PGDG
postgresql-8.0.3-1PGDG

PGDATA is installed on a Netapp network storage device.

We are using slony 1.1.0 for replication.

The (provider) database locked-up after I killed a slony client process
(kill -9) on the subscriber.  Psql connections would not respond to \d
and simply locked up.  I was able to run a query to check for blocking
locks - this returned no rows.  There was a significant test load on the
database at the time.

I stopped the database but was unable to restart it.  I was unable to
kill a number of postgres processes and could not release postgres
shared memory.  Having decided that the database was toast, I discovered
that I could not even delete the database files, and eventually the only
solution was a full reboot.

This hardware was destined to be put into production in the next two
weeks but this crash has shaken our confidence somewhat.  Any advice on
how to further investigate this would be much appreciated

Here is an exerpt from the logs at the time of the failure:
LOG:  duration: 4143.996 ms  statement: execute wibble_transaction
( '2012416', '3410660', '2005-11-04 17:39:49 -0600', '20005', '3', '0',
'{22000,22011,22001,22002,22003,22004,22005,22006,22007,22008,22009,22010}', '{8,0,0,0,2,1,0,7,2,23,26,0}' )
LOG:  duration: 4814.012 ms  statement: execute write_wibble
( '2048847',

'{{15000,17660,0},{15001,3522,0},{15002,0,0},{15003,3851,0},{15004,0,0},{15005,0,0},{15006,0,0},{15007,0,0},{15011,0,0},{15012,0,0},{15013,0,0},{15014,0,0},{15015,0,0},{15016,0,0},{15017,0,0},{15018,0,0},{15019,0,0},{15020,0,0},{15021,0,0},{15022,0,0},{15023,0,0},{15024,0,0},{15025,0,0},{15026,0,0},{15048,0,0},{15030,0,0},{15031,0,0},{15032,0,0},{15033,0,0},{15035,0,0},{15036,0,0},{15037,0,0},{15038,0,0},{15040,0,0},{15041,1,0},{15042,0,0},{15043,0,0},{15046,0,0},{15047,0,0},{15049,1000,0},{15050,0,0},{15051,0,0},{15052,0,0}}'
)
LOG:  unexpected EOF on client connection
LOG:  received fast shutdown request
LOG:  aborting any active transactions
LOG:  received immediate shutdown request
WARNING:  terminating connection because of crash of another server
process
DETAIL:  The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
WARNING:  terminating connection because of crash of another server
process
DETAIL:  The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.


My lock checking query:

select d.datname || '.' || c.relname as object,      l.transaction as trans, l.pid, l.mode,       case when b.blocked
thenb.blocker else null end as blocker 
from  (select w.pid as pid, h.pid as blocker, 't'::bool as blockedfrom   pg_locks h, pg_locks wwhere  h.granted and
notw.granted and   (   (h.relation = w.relation and h.database = w.database)        or  h.transaction = w.transaction)
    unionselect h.pid, null, 'f'::bool as blockedfrom   pg_locks h, pg_locks wwhere  h.granted and    not w.granted and
 (   (h.relation = w.relation and h.database = w.database)        or  h.transaction = w.transaction)      ) b,
pg_locksl 
left outer join pg_database d        on d.oid = l.database
left outer join  pg_class c        on c.oid = l.relation
where   l.pid = b.pid
order by l.pid;

Thanks for any suggestions.

__
Marc

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Crash during elog.c...
Next
From: "Chuck McDevitt"
Date:
Subject: Re: Another pgindent gripe