Thread: 7.4b4 undetected deadlock?
I'm running 7.4b4 and ran into a little problem with my import and pg_autovacuum's call of VACUUM. PostgreSQL 7.4beta4 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.2.2 (Mandrake Linux 9.1 3.2.2-3mdk) $ ./configure --with-tcl --with-openssl=/usr/include/openssl --with-CXX --enable-syslog --with-tclconfig=/usr/lib --with-perl postgres 16542 16541 0 Oct04 ? 00:00:02 /usr/local/pgsql/bin/pg_autovacuum -d 0 postgres 20209 1 0 Oct04 pts/4 00:00:01 /usr/local/pgsql/bin/postmaster -D /var/lib/pgsql/data -D /var/lib/pgsql/data postgres 20213 20209 0 Oct04 pts/4 00:00:10 postgres: stats buffer process postgres 20215 20213 0 Oct04 pts/4 00:00:11 postgres: stats collector process postgres 20231 20209 0 Oct04 pts/4 00:08:09 postgres: perl_test cpan 192.168.0.250 idle postgres 29304 20209 0 17:37 pts/4 00:00:00 postgres: robert tassiv 192.168.0.250 idle in transaction postgres 29305 20209 0 17:37 pts/4 00:00:00 postgres: robert tassiv 192.168.0.250 idle postgres 29335 20209 0 17:37 pts/4 00:00:01 postgres: robert tassiv 192.168.0.250 REINDEX waiting postgres 29336 20209 0 17:37 pts/4 00:00:00 postgres: robert tassiv 192.168.0.250 idle postgres 29549 20209 0 17:44 pts/4 00:00:00 postgres: postgres tassiv [local] VACUUM waiting The reindex command is: REINDEX TABLE temp_obs_i or REINDEX TABLE temp_obs_v temp_obs_? is not a temporary table: tassiv=# \d temp_obs_v Table "public.temp_obs_v" Column | Type | Modifiers ---------+---------+------------------------------------------------ x | real | not null y | real | not null imag | real | not null smag | real | not null loc | spoint | not null obs_id | integer | not null default nextval('"obs_id_seq"'::text) file_id | integer | not null use | boolean | default false solve | boolean | default false star_id | integer | mag | real | Indexes: "temp_obs_v_file_id_index" btree (file_id) "temp_obs_v_loc_index" gist (loc) Inherits: obs_root tassiv=# \d obs_root Table "public.obs_root" Column | Type | Modifiers ---------+---------+------------------------------------------------ x | real | not null y | real | not null imag | real | not null smag | real | not null loc | spoint | not null obs_id | integer | not null default nextval('"obs_id_seq"'::text) file_id | integer | not null use | boolean | default false solve | boolean | default false star_id | integer | mag | real | Foreign-key constraints: "$1" FOREIGN KEY (file_id) REFERENCES files(file_id) "$2" FOREIGN KEY (star_id) REFERENCES catalog(star_id) backtrace from 29549 #0 0x403279d7 in semop () from /lib/i686/libc.so.6 #1 0x081333c1 in PGSemaphoreLock () #2 0x081508bc in ProcSleep () #3 0x0814f814 in WaitOnLock () #4 0x0814f429 in LockAcquire () #5 0x0814e999 in LockRelation () #6 0x0807b2f9 in relation_open () #7 0x080e92c4 in vacuum_rel () #8 0x080e8be1 in vacuum () #9 0x08158ce5 in PortalRunUtility () #10 0x08158f78 in PortalRunMulti () #11 0x0815887e in PortalRun () #12 0x08155b5c in exec_simple_query () #13 0x08157b87 in PostgresMain () #14 0x0813663f in BackendFork () #15 0x08136108 in BackendStartup () #16 0x0813498d in ServerLoop () #17 0x08134318 in PostmasterMain () #18 0x0810a09c in main () #19 0x4025e7f7 in __libc_start_main () from /lib/i686/libc.so.6 and from 29335 #0 0x403279d7 in semop () from /lib/i686/libc.so.6 #1 0x081333c1 in PGSemaphoreLock () #2 0x081508bc in ProcSleep () #3 0x0814f814 in WaitOnLock () #4 0x0814f429 in LockAcquire () #5 0x0814e999 in LockRelation () #6 0x0807b2f9 in relation_open () #7 0x0807b444 in heap_open () #8 0x0809fff4 in reindex_relation () #9 0x080d2c40 in ReindexTable () #10 0x08158ce5 in PortalRunUtility () #11 0x08158f78 in PortalRunMulti () #12 0x0815887e in PortalRun () #13 0x08155b5c in exec_simple_query () #14 0x08157b87 in PostgresMain () #15 0x0813663f in BackendFork () #16 0x08136108 in BackendStartup () #17 0x0813498d in ServerLoop () #18 0x08134318 in PostmasterMain () #19 0x0810a09c in main () #20 0x4025e7f7 in __libc_start_main () from /lib/i686/libc.so.6 Cheers, Rob -- 17:55:02 up 65 days, 10:24, 4 users, load average: 2.20, 2.16, 2.57
Robert Creager <Robert_Creager@LogicalChaos.org> writes: > postgres 29304 20209 0 17:37 pts/4 00:00:00 postgres: robert tassiv 192.168.0.250 idle in transaction Almost certainly, that guy is holding the lock REINDEX wants. The VACUUM is probably backed up behind the REINDEX. If you want to convince me it's a deadlock, you'll need to exhibit pg_locks contents. regards, tom lane
When grilled further on (Sun, 05 Oct 2003 20:34:15 -0400), Tom Lane <tgl@sss.pgh.pa.us> confessed: > > Almost certainly, that guy is holding the lock REINDEX wants. The > VACUUM is probably backed up behind the REINDEX. If you want to > convince me it's a deadlock, you'll need to exhibit pg_locks contents. > tassiv=# select relname, pg_locks.* from pg_locks left join pg_class on ( pg_locks.relation = pg_class.oid ); relname | relation | database | transaction | pid | mode | granted -------------------+----------+----------+-------------+-------+--------------- -----------+--------- fits | 50033492 | 50032754 | | 29304 | AccessShareLock | t fits | 50033492 | 50032754 | | 29304 | RowExclusiveLock | t files | 50033470 | 50032754 | | 29304 | AccessShareLock | t files | 50033470 | 50032754 | | 29304 | RowShareLock | t files | 50033470 | 50032754 | | 29304 | RowExclusiveLock | t temp_obs_v | 50033589 | 50032754 | | 29335 | RowExclusiveLock | t files_file_id_seq | 50033468 | 50032754 | | 29304 | AccessShareLock | t groups | 50033448 | 50032754 | | 29335 | AccessShareLock | t groups | 50033448 | 50032754 | | 29335 | RowShareLock | t files_file_id_seq | 50033468 | 50032754 | | 29335 | AccessShareLock | t obs_id_seq | 50033525 | 50032754 | | 29304 | AccessShareLock | t pg_locks | 16759 | 50032754 | | 30797 | AccessShareLock | t pairs_pair_id_seq | 50033457 | 50032754 | | 29335 | AccessShareLock | t obs_id_seq | 50033525 | 50032754 | | 29335 | AccessShareLock | t temp_obs_v | 50033589 | 50032754 | | 29304 | AccessShareLock | t temp_obs_v | 50033589 | 50032754 | | 29304 | RowExclusiveLock | t imported | 50033638 | 50032754 | | 29304 | AccessShareLock | t imported | 50033638 | 50032754 | | 29304 | RowExclusiveLock | t pairs_pair_id_seq | 50033457 | 50032754 | | 29304 | AccessShareLock | t temp_obs_i | 50033625 | 50032754 | | 29335 | RowExclusiveLock | t temp_obs_i | 50033625 | 50032754 | | 29335 | AccessExclusiveLock | f imported | 50033638 | 50032754 | | 29335 | AccessShareLock | t imported | 50033638 | 50032754 | | 29335 | RowExclusiveLock | t | 65952259 | 50032754 | | 29304 | AccessExclusiveLock | t | | | 841262 | 29549 | ExclusiveLock | t | | | 841339 | 30797 | ExclusiveLock | t pairs | 50033459 | 50032754 | | 29304 | AccessShareLock | t pairs | 50033459 | 50032754 | | 29304 | RowShareLock | t pairs | 50033459 | 50032754 | | 29304 | RowExclusiveLock | t temp_obs_i | 50033625 | 50032754 | | 29549 | ShareUpdateExclusiveLock | f pairs | 50033459 | 50032754 | | 29335 | AccessShareLock | t pairs | 50033459 | 50032754 | | 29335 | RowShareLock | t pairs | 50033459 | 50032754 | | 29335 | RowExclusiveLock | t groups | 50033448 | 50032754 | | 29304 | AccessShareLock | t groups | 50033448 | 50032754 | | 29304 | RowShareLock | t | | | 841207 | 29304 | ExclusiveLock | t | | | 841234 | 29335 | ExclusiveLock | t files | 50033470 | 50032754 | | 29335 | AccessShareLock | t files | 50033470 | 50032754 | | 29335 | RowShareLock | t files | 50033470 | 50032754 | | 29335 | RowExclusiveLock | t temp_obs_i | 50033625 | 50032754 | | 29304 | AccessShareLock | t temp_obs_i | 50033625 | 50032754 | | 29304 | RowExclusiveLock | t pg_class | 1259 | 50032754 | | 30797 | AccessShareLock | t fits | 50033492 | 50032754 | | 29335 | AccessShareLock | t fits | 50033492 | 50032754 | | 29335 | RowExclusiveLock | t (45 rows) -- 18:41:49 up 65 days, 11:11, 4 users, load average: 2.20, 2.06, 2.07
Robert Creager <Robert_Creager@LogicalChaos.org> writes: > tassiv=# select relname, pg_locks.* from pg_locks left join pg_class on ( > pg_locks.relation = pg_class.oid ); Okay, let's chase it back. The VACUUM (pid 29549) is blocked trying to lock temp_obs_i for vacuuming: > temp_obs_i | 50033625 | 50032754 | | 29549 | > ShareUpdateExclusiveLock | f which is evidently blocked behind the REINDEX's lock request: > temp_obs_i | 50033625 | 50032754 | | 29335 | > AccessExclusiveLock | f since there isn't anything else that would block it. The REINDEX is blocked by these locks: > temp_obs_i | 50033625 | 50032754 | | 29304 | AccessShareLock > | t > temp_obs_i | 50033625 | 50032754 | | 29304 | > RowExclusiveLock | t so indeed everyone is waiting for this process: postgres 29304 20209 0 17:37 pts/4 00:00:00 postgres: robert tassiv 192.168.0.250 idle in transaction which is not waiting for any lock but is just sitting on its duff while holding quite a pile of locks. Ergo, no deadlock, just badly written client software. BTW, the 29335 process is holding rather a large pile of locks itself, way more than I'd expect for a REINDEX alone. It looks like you must be running the REINDEX inside a transaction that has already touched a number of other tables. I'd have to counsel rethinking that --- even though there's no deadlock in this case, you are surely risking deadlocks. Since REINDEX needs an exclusive lock, you should try to minimize the number of other locks you have when you start the REINDEX. regards, tom lane
When grilled further on (Sun, 05 Oct 2003 23:22:45 -0400), Tom Lane <tgl@sss.pgh.pa.us> confessed: > > so indeed everyone is waiting for this process: > > postgres 29304 20209 0 17:37 pts/4 00:00:00 postgres: robert tassiv > 192.168.0.250 idle in transaction Hmmm... Not sure what that was doing. I'm only running one client. I had just aborted an import, and may have left the client suspended, rather than killing it. Unfortunately, I lost my connection before I got your e-mail, so I cannot check. > > which is not waiting for any lock but is just sitting on its duff while > holding quite a pile of locks. Ergo, no deadlock, just badly written > client software. Ok. I'll believe that. I jumped a little early as I did run into a real problem a couple of months ago on 7.4b1 that you fixed, which was strikingly similar to what I just saw. > > BTW, the 29335 process is holding rather a large pile of locks itself, > way more than I'd expect for a REINDEX alone. It looks like you must > be running the REINDEX inside a transaction that has already touched > a number of other tables. I'd have to counsel rethinking that --- > even though there's no deadlock in this case, you are surely risking > deadlocks. Since REINDEX needs an exclusive lock, you should try to > minimize the number of other locks you have when you start the REINDEX. Well, I'm inserting single records into 4 tables, updating another using a temp table, importing ~3k records into two tables (each), then joining those last two tables (hence the REINDEX) to get matching entries, then selecting the results into the final table, all in one transaction. The indexes are required for reasonable performance. The data is spatially consistent, so I VACUUM on the first data set, then REINDEX all subsequent sets. I'm not enough of a dB wiz to figure out a better way of doing this, as none of the data can go in if all of it doesn't. And all the tables being hit are FK linked in some fashion. I guess I could reverse the process which would hit the 3k record sets first, and then take the join... Thanks, Rob -- 21:33:12 up 65 days, 14:02, 4 users, load average: 2.18, 2.79, 2.53
Robert Creager <Robert_Creager@LogicalChaos.org> writes: > Tom Lane <tgl@sss.pgh.pa.us> confessed: >> so indeed everyone is waiting for this process: >> >> postgres 29304 20209 0 17:37 pts/4 00:00:00 postgres: robert tassiv >> 192.168.0.250 idle in transaction > Hmmm... Not sure what that was doing. I'm only running one client. > I had just aborted an import, and may have left the client suspended, > rather than killing it. Well, the problem may just reduce to that then. >> ... I'd have to counsel rethinking that --- >> even though there's no deadlock in this case, you are surely risking >> deadlocks. Since REINDEX needs an exclusive lock, you should try to >> minimize the number of other locks you have when you start the REINDEX. > Well, I'm inserting single records into 4 tables, updating another > using a temp table, importing ~3k records into two tables (each), then > joining those last two tables (hence the REINDEX) to get matching > entries, then selecting the results into the final table, all in one > transaction. This is perfectly okay as long as you're not expecting concurrent transactions on those same tables. If you are then there may be a risk. The temp table is obviously not a problem, but without detailed analysis of access patterns to the permanent tables, it's hard to say whether this is problematic or not. regards, tom lane