Thread: auto vacuum lock on 8.1beta1
I have a vacuum process kicked of by autovacuum that appears hung and causing general grief. When I put too many queries at the db in this state, the Context Switches cruises up to ~90k and stay there. Queries that normally take < 1 second are up to over a minute. The autovacuum thread has been sitting on the db for 24 hours now. Might this be the same problem as the recent thread "database vacuum from cron hanging" where Tom is: "I'm busy volatile-izing all the code in bufmgr.c ... should be able to commit a fix soon."? I can provide more information as requested. postgres 30463 1 0 Oct11 pts/0 00:00:02 /usr/local/pgsql810/bin/postmaster -D /var/lib/pgsql/data810 postgres 30466 30463 0 Oct11 pts/0 00:02:04 postgres: writer process postgres 30467 30463 0 Oct11 pts/0 00:02:07 postgres: stats buffer process postgres 30469 30467 0 Oct11 pts/0 00:03:48 postgres: stats collector process postgres 21119 30463 0 Oct12 pts/0 00:00:00 postgres: integration_test int_test_new 129.80.80.16(49465) idle in transaction postgres 21120 30463 0 Oct12 pts/0 00:00:00 postgres: integration_test int_test_new 129.80.80.16(49466) idle in transaction postgres 21121 30463 0 Oct12 pts/0 00:00:01 postgres: integration_test int_test_new 129.80.80.16(49467) idle in transaction postgres 21157 30463 0 Oct12 pts/0 00:00:00 postgres: integration_test int_test_new 129.80.80.16(49885) idle in transaction postgres 23592 30463 0 Oct12 pts/0 00:00:00 postgres: integration_test int_test_new 129.80.17.25(53872) idle in transaction postgres 23593 30463 0 Oct12 pts/0 00:00:00 postgres: integration_test int_test_new 129.80.17.25(53873) idle in transaction postgres 23594 30463 1 Oct12 pts/0 00:12:46 postgres: integration_test int_test_new 129.80.17.25(53874) idle in transaction postgres 7041 30463 0 09:34 pts/0 00:00:00 postgres: creagrs int_test_new 129.80.80.164(49930) idle postgres 21064 30463 0 Oct12 pts/0 00:00:00 postgres: autovacuum process int_test_new postgres 8138 30463 2 11:05 pts/0 00:01:39 postgres: integration_test int_test_new 129.80.17.25(54366) idle in transaction postgres 8139 30463 4 11:05 pts/0 00:03:16 postgres: integration_test int_test_new 129.80.17.25(54367) idle in transaction postgres 8140 30463 4 11:05 pts/0 00:03:18 postgres: integration_test int_test_new 129.80.17.25(54368) idle in transaction postgres 8141 30463 4 11:05 pts/0 00:03:55 postgres: integration_test int_test_new 129.80.17.25(54369) idle in transaction postgres 8142 30463 4 11:05 pts/0 00:03:28 postgres: integration_test int_test_new 129.80.17.25(54370) idle in transaction postgres 8143 30463 4 11:05 pts/0 00:03:24 postgres: integration_test int_test_new 129.80.17.25(54371) idle in transaction postgres 8144 30463 4 11:05 pts/0 00:03:56 postgres: integration_test int_test_new 129.80.17.25(54372) idle in transaction postgres 8145 30463 2 11:05 pts/0 00:01:48 postgres: integration_test int_test_new 129.80.17.25(54373) idle in transaction postgres 8146 30463 4 11:05 pts/0 00:03:13 postgres: integration_test int_test_new 129.80.17.25(54374) idle in transaction postgres 8147 30463 4 11:05 pts/0 00:03:21 postgres: integration_test int_test_new 129.80.17.25(54375) idle in transaction postgres 8148 30463 4 11:05 pts/0 00:03:29 postgres: integration_test int_test_new 129.80.17.25(54376) idle in transaction postgres 8149 30463 2 11:05 pts/0 00:01:37 postgres: integration_test int_test_new 129.80.17.25(54377) idle in transaction postgres 8150 30463 4 11:05 pts/0 00:03:30 postgres: integration_test int_test_new 129.80.17.25(54378) idle in transaction postgres 8151 30463 5 11:05 pts/0 00:04:16 postgres: integration_test int_test_new 129.80.17.25(54379) idle in transaction postgres 8152 30463 1 11:05 pts/0 00:01:35 postgres: integration_test int_test_new 129.80.17.25(54380) idle in transaction postgres 8153 30463 4 11:05 pts/0 00:03:26 postgres: integration_test int_test_new 129.80.17.25(54381) idle in transaction postgres 9114 30463 0 12:24 pts/0 00:00:00 postgres: creagrs creagrs 129.80.80.164(50863) idle Process 21064 is not moving, and other manual vacuum attempt block at some point. (gdb) bt #0 0x401f1e78 in semop () from /lib/i686/libc.so.6 #1 0x0834ba38 in ?? () #2 0x0816e7d4 in PGSemaphoreLock () #3 0x08192936 in ProcWaitForSignal () #4 0x08187bce in LockBufferForCleanup () #5 0x0811c37c in lazy_vacuum_heap () #6 0x0811bcb0 in lazy_scan_heap () #7 0x0811baad in lazy_vacuum_rel () #8 0x081180b9 in vacuum_rel () #9 0x0811763b in vacuum () #10 0x081703fd in autovacuum_do_vac_analyze () #11 0x08170017 in do_autovacuum () #12 0x0816fb4a in AutoVacMain () #13 0x0816f831 in autovac_start () #14 0x08175202 in ServerLoop () #15 0x0817441a in PostmasterMain () #16 0x0813be94 in main () #17 0x4012bc57 in __libc_start_main () from /lib/i686/libc.so.6 The last log entries (for autovacuum) are (now is Thu Oct 13 12:32:18 MDT 2005): Oct 12 13:48:37 annette postgres810[20936]: [2-1] LOG: autovacuum: processing database "int_test_new" Oct 12 13:49:37 annette postgres810[20947]: [2-1] LOG: autovacuum: processing database "creagrs" Oct 12 13:50:37 annette postgres810[20957]: [2-1] LOG: autovacuum: processing database "postgres" Oct 12 13:51:37 annette postgres810[20967]: [2-1] LOG: autovacuum: processing database "unit_test" Oct 12 13:52:37 annette postgres810[20977]: [2-1] LOG: autovacuum: processing database "template1" Oct 12 13:54:11 annette postgres810[21000]: [2-1] LOG: autovacuum: processing database "int_test_new" Oct 12 13:55:16 annette postgres810[21016]: [2-1] LOG: autovacuum: processing database "creagrs" Oct 12 13:56:16 annette postgres810[21033]: [2-1] LOG: autovacuum: processing database "postgres" Oct 12 13:57:16 annette postgres810[21043]: [2-1] LOG: autovacuum: processing database "unit_test" Oct 12 13:58:16 annette postgres810[21053]: [2-1] LOG: autovacuum: processing database "template1" Oct 12 13:59:16 annette postgres810[21064]: [2-1] LOG: autovacuum: processing database "int_test_new" creagrs=# select version(); version ------------------------------------------------------------------------------- ----------------------------PostgreSQL 8.1beta2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.3.1 (Mandrake Linux 9.2 3.3.1-2mdk) -- Robert Creager Advisory Software Engineer Data Management Group Sun Microsystems Robert.Creager@Sun.com 303.673.2365 Office 888.912.4458 Pager
Robert Creager <Robert.Creager@Sun.com> writes: > Might this be the same problem as the recent thread "database vacuum from cron > hanging" where Tom is: "I'm busy volatile-izing all the code in bufmgr.c ... > should be able to commit a fix soon."? Seems reasonably likely, seeing that the original report involved gcc 3.3.something IIRC, and you're using 3.3.1. Is this an SMP box? The bug could theoretically manifest on a uniprocessor but it seems more likely to happen on a multiprocessor. Too bad you didn't have it built with --enable-debug; I can't think of any very easy way to verify a negative refcount for that buffer without gdb support. You could try inspecting the assembly code generated for PinBuffer, as we did with Kevin's compiler. If it's generating the same code sequence then that would make it pretty likely that you're seeing the same thing. The volatile patch should be available in last night's nightly snapshot, if you just want to update. regards, tom lane
I can confirm that the patch was in the snapshot I picked up this morning at about 10:30 CDT. We've been using it since then and have not seen the problem in spite of attempting to provoke it with database vacuums. -Kevin >>> Tom Lane <tgl@sss.pgh.pa.us> 10/13/05 2:09 PM >>> Robert Creager <Robert.Creager@Sun.com> writes: > Might this be the same problem as the recent thread "database vacuum from cron > hanging" where Tom is: "I'm busy volatile-izing all the code in bufmgr.c ... > should be able to commit a fix soon."? Seems reasonably likely, seeing that the original report involved gcc 3.3.something IIRC, and you're using 3.3.1. Is this an SMP box? The bug could theoretically manifest on a uniprocessor but it seems more likely to happen on a multiprocessor. Too bad you didn't have it built with --enable-debug; I can't think of any very easy way to verify a negative refcount for that buffer without gdb support. You could try inspecting the assembly code generated for PinBuffer, as we did with Kevin's compiler. If it's generating the same code sequence then that would make it pretty likely that you're seeing the same thing. The volatile patch should be available in last night's nightly snapshot, if you just want to update. regards, tom lane
On Thu, 13 Oct 2005 15:09:58 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Creager <Robert.Creager@Sun.com> writes: > > Might this be the same problem as the recent thread "database vacuum from > > cron hanging" where Tom is: "I'm busy volatile-izing all the code in > > bufmgr.c ... should be able to commit a fix soon."? > > Seems reasonably likely, seeing that the original report involved gcc > 3.3.something IIRC, and you're using 3.3.1. Is this an SMP box? The > bug could theoretically manifest on a uniprocessor but it seems more > likely to happen on a multiprocessor. Yes, it's a dual Xenon box. > > Too bad you didn't have it built with --enable-debug; I can't think of > any very easy way to verify a negative refcount for that buffer without > gdb support. I just downloaded last nights snapshot. I'll build with debug and try it out. Thanks, Rob
On Thu, 13 Oct 2005 14:20:46 -0500 Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > I can confirm that the patch was in the snapshot I picked up this > morning at about 10:30 CDT. We've been using it since then and > have not seen the problem in spite of attempting to provoke it with > database vacuums. > Thanks Kevin. I'm giving it a try now.