Thread: Database deadlock/hanging

Database deadlock/hanging

From
John Gateley
Date:
Hi,

My database stopped responding last night (Postgres 8.1.4).

at 2 am, a vacuum began running:
/usr/local/pgsql/bin/vacuumdb -afz

When I came in to work this morning, I could query some tables
but not others. There were many(100) processes like:
postgres 11791  6901  0 02:07 ?        00:00:00 postgres: www-data yadb2 127.0.0.1(40883) SELECT waiting

I began running an update yesterday that may or may not have completed.
I was moving data out of one table into a new table, and setting
an ID field to point to the new table. That may or may not have
completed (it was done by this morning). It is not a regular occurence
and today was the first time I saw this hanging behavior, so it's
probably related.

The postmaster was using a lot of CPU.

I stopped the postmaster with a SIGINT, it stopped quickly
and came back up automatically (I'm using daemontools) and
everything is fine.

I'm guessing it's something related to table locks.

Any pointers on what I should be looking for to prevent this from
happening again? What information I should be tracking to figure
out what is exactly happening?

Thanks very much,

j

Re: Database deadlock/hanging

From
Reece Hart
Date:
On Wed, 2007-03-07 at 10:08 -0600, John Gateley wrote:
> I'm guessing it's something related to table locks.
...
> Any pointers on what I should be looking for to prevent this from
> happening again? What information I should be tracking to figure
> out what is exactly happening?

Your inserts almost certainly have a table or index exclusively locked
and thereby causing a backlog of selects.

You can fish current and waiting locks out of pg_locks, but those use
internal identifiers rather than names.  Here's a view that will make
pg_locks more readable:

rkh@csb-dev=> CREATE OR REPLACE VIEW pgutils.locks AS
 SELECT l.pid, d.datname AS "database", n.nspname AS "schema", c.relname AS relation, l.locktype, l."mode",
        CASE l."granted"
            WHEN true THEN 'RUN'::text
            ELSE 'WAIT'::text
        END AS state, a.usename, a.current_query, to_char(now() - a.query_start, 'HH24:MI:SS'::text) AS duration
   FROM pg_locks l
   JOIN pg_database d ON l."database" = d.oid
   JOIN pg_class c ON l.relation = c.oid
   JOIN pg_namespace n ON c.relnamespace = n.oid
   JOIN pg_stat_activity a ON l.pid = a.procpid
  ORDER BY l.pid, d.datname, n.nspname, c.relname, l."granted";

rkh@csb-dev=> select * from pgutils.locks ;
  pid  | database |   schema   |      relation      | locktype |      mode       | state | usename | current_query |
duration 

-------+----------+------------+--------------------+----------+-----------------+-------+---------+---------------+----------
 28434 | csb-dev  | pg_catalog | pg_class           | relation | AccessShareLock | RUN   | rkh     | <IDLE>        |
00:00:21
 28434 | csb-dev  | pg_catalog | pg_class_oid_index | relation | AccessShareLock | RUN   | rkh     | <IDLE>        |
00:00:21
 28434 | csb-dev  | pg_catalog | pg_locks           | relation | AccessShareLock | RUN   | rkh     | <IDLE>        |
00:00:21
 28434 | csb-dev  | pg_catalog | pg_namespace       | relation | AccessShareLock | RUN   | rkh     | <IDLE>        |
00:00:21
 28434 | csb-dev  | pg_catalog | pg_stat_activity   | relation | AccessShareLock | RUN   | rkh     | <IDLE>        |
00:00:21
 28434 | csb-dev  | pgutils    | locks              | relation | AccessShareLock | RUN   | rkh     | <IDLE>        |
00:00:21
(6 rows)

-Reece

--
Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0
./universe -G 6.672e-11 -e 1.602e-19 -protonmass 1.673e-27 -uspres bush
kernel warning: universe consuming too many resources. Killing.
universe killed due to catastrophic leadership. Try -uspres carter.


Re: Database deadlock/hanging

From
John Gateley
Date:
On Wed, 07 Mar 2007 09:27:04 -0800
Reece Hart <reece@harts.net> wrote:

> On Wed, 2007-03-07 at 10:08 -0600, John Gateley wrote:
> > I'm guessing it's something related to table locks.
> ...
> > Any pointers on what I should be looking for to prevent this from
> > happening again? What information I should be tracking to figure
> > out what is exactly happening?
>
> Your inserts almost certainly have a table or index exclusively locked
> and thereby causing a backlog of selects.

Thanks. It turns out it was my nightly vacuuming of the database.
I had the full option set, and I had added a large table, and I
think it was just trying to finish the vacuum (it took about
15 minutes to do a vacuumdb -az, would several hours be reasonable
for afz? Or maybe there was some deadlock with table access?)

Moral - full isn't always better (and if I RTFM, I would have
known that...)

j