Thread: 7.4b4 undetected deadlock?

7.4b4 undetected deadlock?

From
Robert Creager
Date:
I'm running 7.4b4 and ran into a little problem with my import and pg_autovacuum's call of VACUUM.

PostgreSQL 7.4beta4 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.2.2 (Mandrake Linux 9.1 3.2.2-3mdk)

$ ./configure --with-tcl --with-openssl=/usr/include/openssl --with-CXX --enable-syslog --with-tclconfig=/usr/lib
--with-perl

postgres 16542 16541  0 Oct04 ?        00:00:02 /usr/local/pgsql/bin/pg_autovacuum -d 0
postgres 20209     1  0 Oct04 pts/4    00:00:01 /usr/local/pgsql/bin/postmaster -D /var/lib/pgsql/data -D
/var/lib/pgsql/data
postgres 20213 20209  0 Oct04 pts/4    00:00:10 postgres: stats buffer process
     
postgres 20215 20213  0 Oct04 pts/4    00:00:11 postgres: stats collector process
     
postgres 20231 20209  0 Oct04 pts/4    00:08:09 postgres: perl_test cpan 192.168.0.250 idle
     
postgres 29304 20209  0 17:37 pts/4    00:00:00 postgres: robert tassiv 192.168.0.250 idle in transaction
     
postgres 29305 20209  0 17:37 pts/4    00:00:00 postgres: robert tassiv 192.168.0.250 idle
     
postgres 29335 20209  0 17:37 pts/4    00:00:01 postgres: robert tassiv 192.168.0.250 REINDEX waiting
     
postgres 29336 20209  0 17:37 pts/4    00:00:00 postgres: robert tassiv 192.168.0.250 idle
     
postgres 29549 20209  0 17:44 pts/4    00:00:00 postgres: postgres tassiv [local] VACUUM waiting
     

The reindex command is:

REINDEX TABLE temp_obs_i
or
REINDEX TABLE temp_obs_v

temp_obs_? is not a temporary table:

tassiv=# \d temp_obs_v
                     Table "public.temp_obs_v"
 Column  |  Type   |                   Modifiers
---------+---------+------------------------------------------------
 x       | real    | not null
 y       | real    | not null
 imag    | real    | not null
 smag    | real    | not null
 loc     | spoint  | not null
 obs_id  | integer | not null default nextval('"obs_id_seq"'::text)
 file_id | integer | not null
 use     | boolean | default false
 solve   | boolean | default false
 star_id | integer |
 mag     | real    |
Indexes:
    "temp_obs_v_file_id_index" btree (file_id)
    "temp_obs_v_loc_index" gist (loc)
Inherits: obs_root

tassiv=# \d obs_root
                      Table "public.obs_root"
 Column  |  Type   |                   Modifiers
---------+---------+------------------------------------------------
 x       | real    | not null
 y       | real    | not null
 imag    | real    | not null
 smag    | real    | not null
 loc     | spoint  | not null
 obs_id  | integer | not null default nextval('"obs_id_seq"'::text)
 file_id | integer | not null
 use     | boolean | default false
 solve   | boolean | default false
 star_id | integer |
 mag     | real    |
Foreign-key constraints:
    "$1" FOREIGN KEY (file_id) REFERENCES files(file_id)
    "$2" FOREIGN KEY (star_id) REFERENCES catalog(star_id)

backtrace from 29549
#0  0x403279d7 in semop () from /lib/i686/libc.so.6
#1  0x081333c1 in PGSemaphoreLock ()
#2  0x081508bc in ProcSleep ()
#3  0x0814f814 in WaitOnLock ()
#4  0x0814f429 in LockAcquire ()
#5  0x0814e999 in LockRelation ()
#6  0x0807b2f9 in relation_open ()
#7  0x080e92c4 in vacuum_rel ()
#8  0x080e8be1 in vacuum ()
#9  0x08158ce5 in PortalRunUtility ()
#10 0x08158f78 in PortalRunMulti ()
#11 0x0815887e in PortalRun ()
#12 0x08155b5c in exec_simple_query ()
#13 0x08157b87 in PostgresMain ()
#14 0x0813663f in BackendFork ()
#15 0x08136108 in BackendStartup ()
#16 0x0813498d in ServerLoop ()
#17 0x08134318 in PostmasterMain ()
#18 0x0810a09c in main ()
#19 0x4025e7f7 in __libc_start_main () from /lib/i686/libc.so.6

and from 29335
#0  0x403279d7 in semop () from /lib/i686/libc.so.6
#1  0x081333c1 in PGSemaphoreLock ()
#2  0x081508bc in ProcSleep ()
#3  0x0814f814 in WaitOnLock ()
#4  0x0814f429 in LockAcquire ()
#5  0x0814e999 in LockRelation ()
#6  0x0807b2f9 in relation_open ()
#7  0x0807b444 in heap_open ()
#8  0x0809fff4 in reindex_relation ()
#9  0x080d2c40 in ReindexTable ()
#10 0x08158ce5 in PortalRunUtility ()
#11 0x08158f78 in PortalRunMulti ()
#12 0x0815887e in PortalRun ()
#13 0x08155b5c in exec_simple_query ()
#14 0x08157b87 in PostgresMain ()
#15 0x0813663f in BackendFork ()
#16 0x08136108 in BackendStartup ()
#17 0x0813498d in ServerLoop ()
#18 0x08134318 in PostmasterMain ()
#19 0x0810a09c in main ()
#20 0x4025e7f7 in __libc_start_main () from /lib/i686/libc.so.6

Cheers,
Rob

--
17:55:02 up 65 days, 10:24,  4 users,  load average: 2.20, 2.16, 2.57

Re: 7.4b4 undetected deadlock?

From
Tom Lane
Date:
Robert Creager <Robert_Creager@LogicalChaos.org> writes:
> postgres 29304 20209  0 17:37 pts/4    00:00:00 postgres: robert tassiv 192.168.0.250 idle in transaction
       

Almost certainly, that guy is holding the lock REINDEX wants.  The
VACUUM is probably backed up behind the REINDEX.  If you want to
convince me it's a deadlock, you'll need to exhibit pg_locks contents.

            regards, tom lane

Re: 7.4b4 undetected deadlock?

From
Robert Creager
Date:
When grilled further on (Sun, 05 Oct 2003 20:34:15 -0400),
Tom Lane <tgl@sss.pgh.pa.us> confessed:

>
> Almost certainly, that guy is holding the lock REINDEX wants.  The
> VACUUM is probably backed up behind the REINDEX.  If you want to
> convince me it's a deadlock, you'll need to exhibit pg_locks contents.
>

tassiv=# select relname, pg_locks.* from pg_locks left join pg_class on (
pg_locks.relation = pg_class.oid );
      relname      | relation | database | transaction |  pid  |           mode
         | granted
-------------------+----------+----------+-------------+-------+---------------
-----------+---------
 fits              | 50033492 | 50032754 |             | 29304 | AccessShareLock
         | t
 fits              | 50033492 | 50032754 |             | 29304 |
RowExclusiveLock         | t
 files             | 50033470 | 50032754 |             | 29304 | AccessShareLock
         | t
 files             | 50033470 | 50032754 |             | 29304 | RowShareLock
         | t
 files             | 50033470 | 50032754 |             | 29304 |
RowExclusiveLock         | t
 temp_obs_v        | 50033589 | 50032754 |             | 29335 |
RowExclusiveLock         | t
 files_file_id_seq | 50033468 | 50032754 |             | 29304 | AccessShareLock
         | t
 groups            | 50033448 | 50032754 |             | 29335 | AccessShareLock
         | t
 groups            | 50033448 | 50032754 |             | 29335 | RowShareLock
         | t
 files_file_id_seq | 50033468 | 50032754 |             | 29335 | AccessShareLock
         | t
 obs_id_seq        | 50033525 | 50032754 |             | 29304 | AccessShareLock
         | t
 pg_locks          |    16759 | 50032754 |             | 30797 | AccessShareLock
         | t
 pairs_pair_id_seq | 50033457 | 50032754 |             | 29335 | AccessShareLock
         | t
 obs_id_seq        | 50033525 | 50032754 |             | 29335 | AccessShareLock
         | t
 temp_obs_v        | 50033589 | 50032754 |             | 29304 | AccessShareLock
         | t
 temp_obs_v        | 50033589 | 50032754 |             | 29304 |
RowExclusiveLock         | t
 imported          | 50033638 | 50032754 |             | 29304 | AccessShareLock
         | t
 imported          | 50033638 | 50032754 |             | 29304 |
RowExclusiveLock         | t
 pairs_pair_id_seq | 50033457 | 50032754 |             | 29304 | AccessShareLock
         | t
 temp_obs_i        | 50033625 | 50032754 |             | 29335 |
RowExclusiveLock         | t
 temp_obs_i        | 50033625 | 50032754 |             | 29335 |
AccessExclusiveLock      | f
 imported          | 50033638 | 50032754 |             | 29335 | AccessShareLock
         | t
 imported          | 50033638 | 50032754 |             | 29335 |
RowExclusiveLock         | t
                   | 65952259 | 50032754 |             | 29304 |
AccessExclusiveLock      | t
                   |          |          |      841262 | 29549 | ExclusiveLock
         | t
                   |          |          |      841339 | 30797 | ExclusiveLock
         | t
 pairs             | 50033459 | 50032754 |             | 29304 | AccessShareLock
         | t
 pairs             | 50033459 | 50032754 |             | 29304 | RowShareLock
         | t
 pairs             | 50033459 | 50032754 |             | 29304 |
RowExclusiveLock         | t
 temp_obs_i        | 50033625 | 50032754 |             | 29549 |
ShareUpdateExclusiveLock | f
 pairs             | 50033459 | 50032754 |             | 29335 | AccessShareLock
         | t
 pairs             | 50033459 | 50032754 |             | 29335 | RowShareLock
         | t
 pairs             | 50033459 | 50032754 |             | 29335 |
RowExclusiveLock         | t
 groups            | 50033448 | 50032754 |             | 29304 | AccessShareLock
         | t
 groups            | 50033448 | 50032754 |             | 29304 | RowShareLock
         | t
                   |          |          |      841207 | 29304 | ExclusiveLock
         | t
                   |          |          |      841234 | 29335 | ExclusiveLock
         | t
 files             | 50033470 | 50032754 |             | 29335 | AccessShareLock
         | t
 files             | 50033470 | 50032754 |             | 29335 | RowShareLock
         | t
 files             | 50033470 | 50032754 |             | 29335 |
RowExclusiveLock         | t
 temp_obs_i        | 50033625 | 50032754 |             | 29304 | AccessShareLock
         | t
 temp_obs_i        | 50033625 | 50032754 |             | 29304 |
RowExclusiveLock         | t
 pg_class          |     1259 | 50032754 |             | 30797 | AccessShareLock
         | t
 fits              | 50033492 | 50032754 |             | 29335 | AccessShareLock
         | t
 fits              | 50033492 | 50032754 |             | 29335 |
RowExclusiveLock         | t
(45 rows)


--
 18:41:49 up 65 days, 11:11,  4 users,  load average: 2.20, 2.06, 2.07

Re: 7.4b4 undetected deadlock?

From
Tom Lane
Date:
Robert Creager <Robert_Creager@LogicalChaos.org> writes:
> tassiv=# select relname, pg_locks.* from pg_locks left join pg_class on (
> pg_locks.relation = pg_class.oid );

Okay, let's chase it back.  The VACUUM (pid 29549) is blocked trying to
lock temp_obs_i for vacuuming:

>  temp_obs_i        | 50033625 | 50032754 |             | 29549 |
> ShareUpdateExclusiveLock | f

which is evidently blocked behind the REINDEX's lock request:

>  temp_obs_i        | 50033625 | 50032754 |             | 29335 |
> AccessExclusiveLock      | f

since there isn't anything else that would block it.  The REINDEX is
blocked by these locks:

>  temp_obs_i        | 50033625 | 50032754 |             | 29304 | AccessShareLock
>          | t
>  temp_obs_i        | 50033625 | 50032754 |             | 29304 |
> RowExclusiveLock         | t

so indeed everyone is waiting for this process:

postgres 29304 20209  0 17:37 pts/4    00:00:00 postgres: robert tassiv 192.168.0.250 idle in transaction
     

which is not waiting for any lock but is just sitting on its duff while
holding quite a pile of locks.  Ergo, no deadlock, just badly written
client software.

BTW, the 29335 process is holding rather a large pile of locks itself,
way more than I'd expect for a REINDEX alone.  It looks like you must
be running the REINDEX inside a transaction that has already touched
a number of other tables.  I'd have to counsel rethinking that ---
even though there's no deadlock in this case, you are surely risking
deadlocks.  Since REINDEX needs an exclusive lock, you should try to
minimize the number of other locks you have when you start the REINDEX.

            regards, tom lane

Re: 7.4b4 undetected deadlock?

From
Robert Creager
Date:
When grilled further on (Sun, 05 Oct 2003 23:22:45 -0400),
Tom Lane <tgl@sss.pgh.pa.us> confessed:

>
> so indeed everyone is waiting for this process:
>
> postgres 29304 20209  0 17:37 pts/4    00:00:00 postgres: robert tassiv
> 192.168.0.250 idle in transaction

Hmmm...  Not sure what that was doing.  I'm only running one client.  I had just
aborted an import, and may have left the client suspended, rather than killing
it.  Unfortunately, I lost my connection before I got your e-mail, so I cannot
check.

>
> which is not waiting for any lock but is just sitting on its duff while
> holding quite a pile of locks.  Ergo, no deadlock, just badly written
> client software.

Ok.  I'll believe that.  I jumped a little early as I did run into a real
problem a couple of months ago on 7.4b1 that you fixed, which was strikingly
similar to what I just saw.

>
> BTW, the 29335 process is holding rather a large pile of locks itself,
> way more than I'd expect for a REINDEX alone.  It looks like you must
> be running the REINDEX inside a transaction that has already touched
> a number of other tables.  I'd have to counsel rethinking that ---
> even though there's no deadlock in this case, you are surely risking
> deadlocks.  Since REINDEX needs an exclusive lock, you should try to
> minimize the number of other locks you have when you start the REINDEX.

Well, I'm inserting single records into 4 tables, updating another using a temp
table, importing ~3k records into two tables (each), then joining those
last two tables (hence the REINDEX) to get matching entries, then selecting the
results into the final table, all in one transaction.  The indexes are required
for reasonable performance.  The data is spatially consistent, so I VACUUM on
the first data set, then REINDEX all subsequent sets.  I'm not enough of a dB
wiz to figure out a better way of doing this, as none of the data can go in if
all of it doesn't.  And all the tables being hit are FK linked in some fashion.

I guess I could reverse the process which would hit the 3k record sets first,
and then take the join...

Thanks,
Rob

--
 21:33:12 up 65 days, 14:02,  4 users,  load average: 2.18, 2.79, 2.53

Re: 7.4b4 undetected deadlock?

From
Tom Lane
Date:
Robert Creager <Robert_Creager@LogicalChaos.org> writes:
> Tom Lane <tgl@sss.pgh.pa.us> confessed:
>> so indeed everyone is waiting for this process:
>>
>> postgres 29304 20209  0 17:37 pts/4    00:00:00 postgres: robert tassiv
>> 192.168.0.250 idle in transaction

> Hmmm...  Not sure what that was doing.  I'm only running one client.
> I had just aborted an import, and may have left the client suspended,
> rather than killing it.

Well, the problem may just reduce to that then.

>> ... I'd have to counsel rethinking that ---
>> even though there's no deadlock in this case, you are surely risking
>> deadlocks.  Since REINDEX needs an exclusive lock, you should try to
>> minimize the number of other locks you have when you start the REINDEX.

> Well, I'm inserting single records into 4 tables, updating another
> using a temp table, importing ~3k records into two tables (each), then
> joining those last two tables (hence the REINDEX) to get matching
> entries, then selecting the results into the final table, all in one
> transaction.

This is perfectly okay as long as you're not expecting concurrent
transactions on those same tables.  If you are then there may be a risk.
The temp table is obviously not a problem, but without detailed analysis
of access patterns to the permanent tables, it's hard to say whether
this is problematic or not.

            regards, tom lane