Thread: What else could I've done? COPY to unlogged tbl "hung"/locked the table
Hi, I'm relatively new to Pg. Question: What could have happened to cause the COPY to unlogged table to hang? And what would the recreate of the unlogged table have "released" that a reboot of the cluster did not. Any other place that I could've looked or done something else other than recreating the unlogged table and/or rebooting the cluster. Timeline: Day before, the cluster host had to be rebooted since it was unresponsive due to perhaps "a massive loop in vi re substitution" (as dev mentioned it) Yesterday, a copy to an unlogged table thru java was hung. Usually it is instantaneous, but it was hung for 2hrs before they contacted me. pg_locks showed me the following: pid | locktype | dbase | table | virtualtransaction | virtualxid | mode | granted 26554 | relation | dbname | pg_class | 8/13 | | AccessShareLock | t 26554 | relation | dbname | pg_class_oid_index | 8/13 | | AccessShareLock | t 26554 | relation | dbname | pg_class_relname_nsp_index | 8/13 | | AccessShareLock | t 26554 | relation | | pg_database | 8/13 | | AccessShareLock | t 26554 | relation | | pg_database_datname_index | 8/13 | | AccessShareLock | t 26554 | relation | | pg_database_oid_index | 8/13 | | AccessShareLock | t 26554 | relation | dbname | pg_locks | 8/13 | | AccessShareLock | t 26554 | virtualxid | | | 8/13 | 8/13 | ExclusiveLock | t 27288 | relation | dbname | unlogged_tbl_idx | 11/33 | | RowExclusiveLock | t 27288 | relation | dbname | unlogged_tbl | 11/33 | | RowExclusiveLock | t 27288 | relation | dbname | unlogged_tbl_key | 11/33 | | RowExclusiveLock | t 27288 | relation | dbname | unlogged_tbl_pkey | 11/33 | | RowExclusiveLock | t 27288 | transactionid | | | 11/33 | | ExclusiveLock | t 27288 | virtualxid | | | 11/33 | 11/33 | ExclusiveLock | t When the developer killed the shell script from his end, it didn't release the locks. neither pg_cancel nor pg_terminate of the copy job worked. so finally a kill -9 of the COPY os process caused the cluster to reboot. 2012-08-29 14:29:58 EDT [8119]: [2-1] LOG: server process (PID 30468) was terminated by signal 9: Killed 2012-08-29 14:29:58 EDT [8119]: [3-1] LOG: terminating any other active server processes 2012-08-29 14:29:58 EDT [25798]: [3-1] WARNING: terminating connection because of crash of another server process 2012-08-29 14:29:58 EDT [25798]: [4-1] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2012-08-29 14:29:58 EDT [25798]: [5-1] HINT: In a moment you should be able to reconnect to the database and repeat your command. The load script was kicked off and it hung again. A fast stop of the server failed, so an immediate stop had to be executed. 2012-08-29 14:45:44 EDT [8119]: [9-1] LOG: received immediate shutdown request 2012-08-29 14:45:44 EDT [27288]: [1-1] WARNING: terminating connection because of crash of another server process 2012-08-29 14:45:44 EDT [27288]: [2-1] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2012-08-29 14:45:44 EDT [27288]: [3-1] HINT: In a moment you should be able to reconnect to the database and repeat your command. 2012-08-29 14:45:44 EDT [27288]: [4-1] CONTEXT: COPY unlogged_tbl, line 1: "50005001|<abacus><address>5291 math drive</address><address2/><alt_names/><city>atlantis..." The developer tried to execute using only the java class and not thru a shell script. The copy process was kicked off and it hung again. I created another table using db=# create table tmp_tbl as select * from unlogged_tbl; and COPY to this tmp_tbl from the same file was instantaneous. However i later noticed that this creates a logged table, NOT an unlogged table (as i had expected :) ) I dropped and recreated the unlogged tables and that seems to have worked. No more hanging of the load script. And so far so good. -- View this message in context: http://postgresql.1045698.n5.nabble.com/What-else-could-I-ve-done-COPY-to-unlogged-tbl-hung-locked-the-table-tp5721983.html Sent from the PostgreSQL - admin mailing list archive at Nabble.com.
In addition here is a backtrace of the process while it hung again today: (thanks to our SA Eric) #0 0x000000396c8d4627 in semop () from /lib64/libc.so.6 #1 0x00000000005dc313 in PGSemaphoreLock () #2 0x0000000000616586 in LWLockAcquire () #3 0x00000000004a718c in ginHeapTupleFastInsert () #4 0x000000000049c808 in gininsert () #5 0x00000000006dd968 in FunctionCall6Coll () #6 0x0000000000474b07 in index_insert () #7 0x00000000005664c8 in ExecInsertIndexTuples () #8 0x0000000000511fa2 in DoCopy () #9 0x0000000000626f71 in standard_ProcessUtility () #10 0x0000000000624299 in PostgresMain () #11 0x0000000000625309 in CreateQueryDesc () #12 0x0000000000625d35 in PortalRun () #13 0x0000000000623262 in PostgresMain () #14 0x00000000005e7ae4 in ClosePostmasterPorts () #15 0x00000000005e8861 in PostmasterMain () #16 0x000000000058c5be in main () -- View this message in context: http://postgresql.1045698.n5.nabble.com/What-else-could-I-ve-done-COPY-to-unlogged-tbl-hung-locked-the-table-tp5721983p5722162.html Sent from the PostgreSQL - admin mailing list archive at Nabble.com.
amacvar <kahitarich-postgresForum@yahoo.com> writes: > In addition here is a backtrace of the process while it hung again today: > (thanks to our SA Eric) > #0 0x000000396c8d4627 in semop () from /lib64/libc.so.6 > #1 0x00000000005dc313 in PGSemaphoreLock () > #2 0x0000000000616586 in LWLockAcquire () > #3 0x00000000004a718c in ginHeapTupleFastInsert () > #4 0x000000000049c808 in gininsert () Interesting. The most plausible explanation I can see for this stack trace is that ginHeapTupleFastInsert is blocked trying to get buffer lock on one of the GIN index pages it locks directly --- either the metapage or the tail page of the fast-insert list. That would suggest a deadlock condition, that is somebody else trying to get the same two locks in the other order. (One of the ways that the LWLock mechanism is lightweight is that it has no deadlock detection, so if this is what's happening the deadlock would persist until you kill the processes.) However, if that's the case then there must be another process also blocked. Next time it happens, could you look around at *all* the postgres processes and collect stack traces from any that don't seem to be making progress? It doesn't appear to me that ginHeapTupleFastInsert itself is in the wrong here, so we need to find out what the conflicting process is doing. Also, as a temporary workaround, you might be able to dodge the problem by disabling the FASTUPDATE storage parameter for your GIN indexes. That won't help find the bug though, so if you could refrain from doing that until we've identified the cause, it'd be appreciated. regards, tom lane
Thank you Tom for the suggestion and the workaround. I think we will be able to hold off on the temporary workaround. Would the 2nd processes in the deadlock show up in pg_locks? (It didn't) An insert from another table that i created to test the datafile did show up as blocked. I got confirmation from developers that no other process was running at the time. Also, even after a reboot, the COPY to the unlogged table hung, until it was dropped and recreated. After the latest occurrence, we took some downtime and I have rebuilt all unlogged tables, no recurrences yet. -- View this message in context: http://postgresql.1045698.n5.nabble.com/What-else-could-I-ve-done-COPY-to-unlogged-tbl-hung-locked-the-table-tp5721983p5722754.html Sent from the PostgreSQL - admin mailing list archive at Nabble.com.