Thread: autovacuum locks
Folks, I got a system here (8.3.7), that is locked up. Few queries waiting for autovacuum aquired locks on a table or two. But it looks like autovacuum is also waiting for some semaphore: #0 0x00f07410 in __kernel_vsyscall () #1 0x00252d2b in semop () from /lib/libc.so.6 #2 0x081ca1ce in PGSemaphoreLock () #3 0x081f590e in ProcWaitForSignal () #4 0x081e8dea in LockBufferForCleanup () #5 0x08160e68 in ?? () #6 0x08161482 in lazy_vacuum_rel () #7 0x081603b5 in ?? () #8 0x081605c4 in vacuum () #9 0x081cd154 in ?? () #10 0x081cd7f1 in ?? () #11 0x081cd911 in StartAutoVacWorker () #12 0x081d6d47 in ?? () #13 <signal handler called> #14 0x00f07410 in __kernel_vsyscall () #15 0x00249fcd in ___newselect_nocancel () from /lib/libc.so.6 #16 0x081d3964 in ?? () #17 0x081d5105 in PostmasterMain () #18 0x0818be60 in main () There's still other transcations going on , and working fine - but two tables seem to be cut out by autovacuum, and other queries stuck waiting for autovacuum to finish. Any ideas ? Is there a possible fix up the 8.3 stream, that would target that ? (I need leverage to upgrade to 8.3.18, and if that's the case - happy days). -- GJ
Excerpts from Gregg Jaskiewicz's message of vie mar 02 07:44:07 -0300 2012: > Folks, > > I got a system here (8.3.7), that is locked up. Few queries waiting > for autovacuum aquired locks on a table or two. > But it looks like autovacuum is also waiting for some semaphore: > > > #0 0x00f07410 in __kernel_vsyscall () > #1 0x00252d2b in semop () from /lib/libc.so.6 > #2 0x081ca1ce in PGSemaphoreLock () > #3 0x081f590e in ProcWaitForSignal () > #4 0x081e8dea in LockBufferForCleanup () > #5 0x08160e68 in ?? () > #6 0x08161482 in lazy_vacuum_rel () > #7 0x081603b5 in ?? () > #8 0x081605c4 in vacuum () > #9 0x081cd154 in ?? () > #10 0x081cd7f1 in ?? () > #11 0x081cd911 in StartAutoVacWorker () > #12 0x081d6d47 in ?? () > #13 <signal handler called> > #14 0x00f07410 in __kernel_vsyscall () > #15 0x00249fcd in ___newselect_nocancel () from /lib/libc.so.6 > #16 0x081d3964 in ?? () > #17 0x081d5105 in PostmasterMain () > #18 0x0818be60 in main () > > There's still other transcations going on , and working fine - but two > tables seem to be cut out by autovacuum, and other queries stuck > waiting for autovacuum to finish. This vacuum is waiting for some other process to release the buffer lock. These locks are usually very short-lived, but maybe you have some other backend doing something weird that won't release it. Do you have idle processes with transactions open, and if you have, what are they doing? Do you have prepared transactions open? (see the pg_prepared_xacts view) > Any ideas ? > Is there a possible fix up the 8.3 stream, that would target that ? I don't think so, though I am not certain. > (I need leverage to upgrade to 8.3.18, and if that's the case - happy > days). Sorry, can't help you there :-) -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On 2 March 2012 11:03, Alvaro Herrera <alvherre@commandprompt.com> wrote: > > Excerpts from Gregg Jaskiewicz's message of vie mar 02 07:44:07 -0300 2012: >> Folks, >> >> I got a system here (8.3.7), that is locked up. Few queries waiting >> for autovacuum aquired locks on a table or two. >> But it looks like autovacuum is also waiting for some semaphore: >> >> >> #0 0x00f07410 in __kernel_vsyscall () >> #1 0x00252d2b in semop () from /lib/libc.so.6 >> #2 0x081ca1ce in PGSemaphoreLock () >> #3 0x081f590e in ProcWaitForSignal () >> #4 0x081e8dea in LockBufferForCleanup () >> #5 0x08160e68 in ?? () >> #6 0x08161482 in lazy_vacuum_rel () >> #7 0x081603b5 in ?? () >> #8 0x081605c4 in vacuum () >> #9 0x081cd154 in ?? () >> #10 0x081cd7f1 in ?? () >> #11 0x081cd911 in StartAutoVacWorker () >> #12 0x081d6d47 in ?? () >> #13 <signal handler called> >> #14 0x00f07410 in __kernel_vsyscall () >> #15 0x00249fcd in ___newselect_nocancel () from /lib/libc.so.6 >> #16 0x081d3964 in ?? () >> #17 0x081d5105 in PostmasterMain () >> #18 0x0818be60 in main () >> >> There's still other transcations going on , and working fine - but two >> tables seem to be cut out by autovacuum, and other queries stuck >> waiting for autovacuum to finish. > > This vacuum is waiting for some other process to release the buffer > lock. These locks are usually very short-lived, but maybe you have some > other backend doing something weird that won't release it. Do you have > idle processes with transactions open, and if you have, what are they > doing? Do you have prepared transactions open? (see the > pg_prepared_xacts view) That view is empty, but I know that the queries that are waiting for locks to be released - are in the middle of two phase commit transaction. It could be that the system is doing something, but it is just so slow - that those locks are starved - and no longer so short lived :) Is there any particular info I could gather to make it easier for you guys to look at ? Here's a backtrace from one of the autovacuums (this time with line # info): #0 0x00f07410 in __kernel_vsyscall () #1 0x00252d2b in semop () from /lib/libc.so.6 #2 0x081ca1ce in PGSemaphoreLock (sema=0xb792bdf8, interruptOK=1 '\001') at pg_sema.c:420 #3 0x081f590e in ProcWaitForSignal () at proc.c:1269 #4 0x081e8dea in LockBufferForCleanup (buffer=24516) at bufmgr.c:2304 #5 0x08160e68 in lazy_vacuum_heap (onerel=0x87573c0, vacrelstats=0x86a4768) at vacuumlazy.c:655 #6 0x08161482 in lazy_scan_heap (onerel=0x87573c0, vacstmt=0xbfbcfb50, bstrategy=0x86fbe98) at vacuumlazy.c:591 #7 lazy_vacuum_rel (onerel=0x87573c0, vacstmt=0xbfbcfb50, bstrategy=0x86fbe98) at vacuumlazy.c:191 #8 0x081603b5 in vacuum_rel (relid=<value optimized out>, vacstmt=0xbfbcfb50, expected_relkind=114 'r', for_wraparound=0 '\000') at vacuum.c:1126 #9 0x081605c4 in vacuum (vacstmt=0xbfbcfb50, relids=0x86fd310, bstrategy=0x86fbe98, for_wraparound=0 '\000', isTopLevel=1 '\001') at vacuum.c:427 #10 0x081cd154 in do_autovacuum () at autovacuum.c:2638 #11 0x081cd7f1 in AutoVacWorkerMain (argc=<value optimized out>, argv=<value optimized out>) at autovacuum.c:1613 #12 0x081cd911 in StartAutoVacWorker () at autovacuum.c:1416 #13 0x081d6d47 in StartAutovacuumWorker (postgres_signal_arg=10) at postmaster.c:4109 #14 sigusr1_handler (postgres_signal_arg=10) at postmaster.c:3845 #15 <signal handler called> #16 0x00f07410 in __kernel_vsyscall () #17 0x00249fcd in ___newselect_nocancel () from /lib/libc.so.6 #18 0x081d3964 in ServerLoop () at postmaster.c:1234 #19 0x081d5105 in PostmasterMain (argc=5, argv=0x8641828) at postmaster.c:1029 #20 0x0818be60 in main (argc=5, argv=0x0) at main.c:188 And the connection that is waiting for locks to be released (the locked up connection). That query is hanging there since ~18:00 BST: #0 0x00f07410 in __kernel_vsyscall () #1 0x00252d2b in semop () from /lib/libc.so.6 #2 0x081ca1ce in PGSemaphoreLock (sema=0xb7929618, interruptOK=1 '\001') at pg_sema.c:420 #3 0x081f5c53 in ProcSleep (locallock=0x86578f0, lockMethodTable=0x834e0f4) at proc.c:880 #4 0x081f4a5d in WaitOnLock (locallock=0x86578f0, owner=0x88676d8) at lock.c:1147 #5 0x081f50af in LockAcquire (locktag=0xbfbcf2b8, lockmode=7, sessionLock=0 '\000', dontWait=0 '\000') at lock.c:792 #6 0x081f27f9 in LockTuple (relation=0xa72f53a8, tid=0xbfbcf334, lockmode=7) at lmgr.c:378 #7 0x0809b19c in heap_update (relation=0xa72f53a8, otid=0xbfbcf3da, newtup=0x86d3be0, ctid=0xbfbcf3d4, update_xmax=0xbfbcf3e0, cid=2, crosscheck=0x0, wait=1 '\001') at heapam.c:2376 #8 0x08167f8b in ExecutePlan (queryDesc=0x870d010, direction=ForwardScanDirection, count=0) at execMain.c:1880 #9 ExecutorRun (queryDesc=0x870d010, direction=ForwardScanDirection, count=0) at execMain.c:270 #10 0x0818051e in _SPI_execute_plan (plan=0x87b8948, Values=0x8e960f8, Nulls=0x8e960e8 " ", snapshot=0x0, crosscheck_snapshot=0x0, read_only=0 '\000', fire_triggers=1 '\001', tcount=0) at spi.c:1845 #11 0x08180796 in SPI_execute_plan (plan=0x87b8948, Values=0x8e960f8, Nulls=0x8e960e8 " ", read_only=0 '\000', tcount=0) at spi.c:374 #12 0x0012d488 in exec_stmt_execsql (estate=0xbfbcf7bc, stmt=0x87a2630) at pl_exec.c:2546 #13 0x0012f0cc in exec_stmt (estate=0xbfbcf7bc, stmts=<value optimized out>) at pl_exec.c:1260 #14 exec_stmts (estate=0xbfbcf7bc, stmts=<value optimized out>) at pl_exec.c:1171 #15 0x00130d19 in exec_stmt_block (estate=0xbfbcf7bc, block=0x87a2778) at pl_exec.c:1114 #16 0x001310e7 in plpgsql_exec_trigger (func=0x8844ae0, trigdata=0xbfbcfb78) at pl_exec.c:636 #17 0x00127082 in plpgsql_call_handler (fcinfo=0xbfbcf934) at pl_handler.c:92 #18 0x0814a63a in ExecCallTriggerFunc (trigdata=0xbfbcfb78, tgindx=8, finfo=0x12800b, instr=0x0, per_tuple_context=0x87f5fd8) at trigger.c:1577 #19 0x0814b69b in ExecBRUpdateTriggers (estate=0x885bd28, relinfo=0x885bdb0, tupleid=0xbfbcfc2a, newtuple=0x8899310) at trigger.c:2000 #20 0x0816820e in ExecutePlan (queryDesc=0x8ab81e8, direction=ForwardScanDirection, count=0) at execMain.c:1835 #21 ExecutorRun (queryDesc=0x8ab81e8, direction=ForwardScanDirection, count=0) at execMain.c:270 #22 0x08203054 in ProcessQuery (plan=0x8ab7f18, params=<value optimized out>, dest=0x8ab81a0, completionTag=0xbfbcfcc8 "") at pquery.c:179 #23 0x08203251 in PortalRunMulti (portal=0x86eb0b0, isTopLevel=<value optimized out>, dest=0x8ab81a0, altdest=0x83bf95c, completionTag=0xbfbcfcc8 "") at pquery.c:1242 #24 0x0820345d in FillPortalStore (portal=0x86eb0b0, isTopLevel=1 '\001') at pquery.c:1045 #25 0x08203a31 in PortalRun (portal=0x86eb0b0, count=2147483647, isTopLevel=1 '\001', dest=0x86a16d0, altdest=0x86a16d0, completionTag=0xbfbcffd8 "") at pquery.c:792 #26 0x08200154 in exec_execute_message (argc=4, argv=<value optimized out>, username=0x8644890 "foo") at postgres.c:1942 #27 PostgresMain (argc=4, argv=<value optimized out>, username=0x8644890 "foo") at postgres.c:3688 #28 0x081d4228 in BackendStartup () at postmaster.c:3207 #29 ServerLoop () at postmaster.c:1274 #30 0x081d5105 in PostmasterMain (argc=5, argv=0x8641828) at postmaster.c:1029 #31 0x0818be60 in main (argc=5, argv=0x4489ffff) at main.c:188 -- GJ
Looking at the system bit more now, it look like 'waiting' states are changing for both the query and autovacuum in pg_stat_activity. But very slowly. It looks like they both got into that sort of state that keeps them on the edge of starvation. So this isn't really a deadlock, but rather very poor performance in this corner case. This is a test system, so I can harvest some more data from it. Already got core files for autovacuum and query connections. I'm hoping for more suggestions as to how to gather more intel - so that could be useful to you guys. Unfortunately my knowledge about locking mechanisms in postgresql is rather poor, other then the basic one.
Excerpts from Gregg Jaskiewicz's message of vie mar 02 08:22:22 -0300 2012: > > Looking at the system bit more now, it look like 'waiting' states are > changing for both the query and autovacuum in pg_stat_activity. > But very slowly. It looks like they both got into that sort of state > that keeps them on the edge of starvation. Ouch. > So this isn't really a deadlock, but rather very poor performance in > this corner case. Right. I think I can explain how this locking works: autovacuum needs a "cleanup" lock on the page being processed, which is a special exclusive lock which also requires that no one is holding a "pin" on the buffer. Any process running a query holds a pin on the buffer while inspecting tuples on it; when it's done with tuples on that page it should move on to the next page in the table -- same as autovac. So what seems to be happening here is that the autovac and the scan are in sync walking the table, stepping on each others toes. What I don't know is why they are so closely in sync. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Fri, Mar 2, 2012 at 6:22 AM, Gregg Jaskiewicz <gryzman@gmail.com> wrote: > Looking at the system bit more now, it look like 'waiting' states are > changing for both the query and autovacuum in pg_stat_activity. > But very slowly. It looks like they both got into that sort of state > that keeps them on the edge of starvation. > > So this isn't really a deadlock, but rather very poor performance in > this corner case. Are you making any use of cursors? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Alvaro Herrera <alvherre@commandprompt.com> writes: > Right. I think I can explain how this locking works: autovacuum needs a > "cleanup" lock on the page being processed, which is a special exclusive > lock which also requires that no one is holding a "pin" on the buffer. > Any process running a query holds a pin on the buffer while inspecting > tuples on it; when it's done with tuples on that page it should move on > to the next page in the table -- same as autovac. So what seems to be > happening here is that the autovac and the scan are in sync walking the > table, stepping on each others toes. I don't believe that. The trace shows the other process is waiting for a tuple lock, which is not something that autovacuum would take. Given the reference to prepared transactions, what seems likely is that the UPDATE command is blocked by a prepared transaction (IOW, one that already updated the same tuple) and VACUUM is stuck behind the UPDATE. So the real problem is slow removal of prepared transactions, which most likely is an application logic problem. It's certainly not autovac's fault. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > So the real problem is slow removal of prepared transactions, > which most likely is an application logic problem. It's certainly > not autovac's fault. Yeah, I've seen way too much Java code lately which fails to close ResultSet or Statement (which includes PreparedStatement) objects, leaving it to the vagaries of the Java garbage collector to close these objects right before freeing their memory to the Java heap. People should never count on the finalize() method to do this for them -- the close() method should be invoked as soon as the object is no longer needed, and definitely before it goes out of scope. -Kevin
On 2 Mar 2012, at 15:28, Robert Haas wrote: > On Fri, Mar 2, 2012 at 6:22 AM, Gregg Jaskiewicz <gryzman@gmail.com> wrote: >> Looking at the system bit more now, it look like 'waiting' states are >> changing for both the query and autovacuum in pg_stat_activity. >> But very slowly. It looks like they both got into that sort of state >> that keeps them on the edge of starvation. >> >> So this isn't really a deadlock, but rather very poor performance in >> this corner case. > > Are you making any use of cursors? nope.
Tom, agreed - it looks like we dug the hole and got ourselves into it. But I still want to understand why. It looks like we have rather small table on the host where I see the slowness. And all other tables have triggers that will update one row in that small table. The small table contains single entry per table. The thing is, when I scan pg_locks - I can pretty much see everything waiting for lock to access that table. To grab pg_lock output, I'm using this view: SELECT waiting.locktype AS waiting_locktype, waiting.relation::regclass AS waiting_table, waiting_stm.current_query AS waiting_query, waiting.mode AS waiting_mode, waiting.pid ASwaiting_pid, other.locktype AS other_locktype, other.relation::regclass AS other_table, other_stm.current_query AS other_query, other.mode AS other_mode, other.pid AS other_pid, other.granted AS other_granted FROM pg_catalog.pg_locks AS waiting JOIN pg_catalog.pg_stat_activity AS waiting_stm ON ( waiting_stm.procpid = waiting.pid ) JOIN pg_catalog.pg_locks AS other ON ( ( waiting."database" = other."database" AND waiting.relation = other.relation ) OR waiting.transactionid = other.transactionid ) JOIN pg_catalog.pg_stat_activity AS other_stm ON ( other_stm.procpid = other.pid ) WHERE NOT waiting.granted AND waiting.pid <> other.pid AND other_stm.query_start < now() - interval '14 hours' AND other_stm.current_query NOT LIKE '<IDLE>'; And yes, some updates are there for longer then 14 hours. Now, there's two of those queries in particular - both updating just a single row. Stuck for over 14 hours (2 days now actually). I simply cannot believe that single table in the middle of things will lock stuff up so much. Also, on the subject of prepared transactions (2PC), the "select * from pg_prepared_xacts ;" query simply does not reveal anything, despite the fact that I know that there should be at least two of those open. Unless it only list saved transactions, not a transaction in the middle of operation. I need these 2PC transactions, in order to achieve something close to multi-master replication. But what I think I'll target first, is the triggers updating that single table on my 'main master'. Unless you guys can suggest something better.
Gregg Jaskiewicz wrote: > Also, on the subject of prepared transactions (2PC), the "select * > from pg_prepared_xacts ;" query simply does not reveal anything, > despite the fact that I know that there should be at least two of > those open. > Unless it only list saved transactions, not a transaction in the > middle of operation. Transactions only appear there between the PREPARE statement and the COMMIT. Hopefully that is a small window of time for each transaction. -Kevin