Thread: auto vacuum lock on 8.1beta1

auto vacuum lock on 8.1beta1

From
Robert Creager
Date:
I have a vacuum process kicked of by autovacuum that appears hung and causing
general grief.  When I put too many queries at the db in this state, the Context
Switches cruises up to ~90k and stay there.  Queries that normally take < 1
second are up to over a minute.  The autovacuum thread has been sitting on the
db for 24 hours now.

Might this be the same problem as the recent thread "database vacuum from cron
hanging" where Tom is: "I'm busy volatile-izing all the code in bufmgr.c ...
should be able to commit a fix soon."?  I can provide more information as
requested.

postgres 30463     1  0 Oct11 pts/0    00:00:02
/usr/local/pgsql810/bin/postmaster -D /var/lib/pgsql/data810
postgres 30466 30463  0 Oct11 pts/0    00:02:04 postgres: writer process
postgres 30467 30463  0 Oct11 pts/0    00:02:07 postgres: stats buffer process
postgres 30469 30467  0 Oct11 pts/0    00:03:48 postgres: stats collector
process
postgres 21119 30463  0 Oct12 pts/0    00:00:00 postgres: integration_test
int_test_new 129.80.80.16(49465) idle in transaction
postgres 21120 30463  0 Oct12 pts/0    00:00:00 postgres: integration_test
int_test_new 129.80.80.16(49466) idle in transaction
postgres 21121 30463  0 Oct12 pts/0    00:00:01 postgres: integration_test
int_test_new 129.80.80.16(49467) idle in transaction
postgres 21157 30463  0 Oct12 pts/0    00:00:00 postgres: integration_test
int_test_new 129.80.80.16(49885) idle in transaction
postgres 23592 30463  0 Oct12 pts/0    00:00:00 postgres: integration_test
int_test_new 129.80.17.25(53872) idle in transaction
postgres 23593 30463  0 Oct12 pts/0    00:00:00 postgres: integration_test
int_test_new 129.80.17.25(53873) idle in transaction
postgres 23594 30463  1 Oct12 pts/0    00:12:46 postgres: integration_test
int_test_new 129.80.17.25(53874) idle in transaction
postgres  7041 30463  0 09:34 pts/0    00:00:00 postgres: creagrs int_test_new
129.80.80.164(49930) idle
postgres 21064 30463  0 Oct12 pts/0    00:00:00 postgres: autovacuum process  
int_test_new
postgres  8138 30463  2 11:05 pts/0    00:01:39 postgres: integration_test
int_test_new 129.80.17.25(54366) idle in transaction
postgres  8139 30463  4 11:05 pts/0    00:03:16 postgres: integration_test
int_test_new 129.80.17.25(54367) idle in transaction
postgres  8140 30463  4 11:05 pts/0    00:03:18 postgres: integration_test
int_test_new 129.80.17.25(54368) idle in transaction
postgres  8141 30463  4 11:05 pts/0    00:03:55 postgres: integration_test
int_test_new 129.80.17.25(54369) idle in transaction
postgres  8142 30463  4 11:05 pts/0    00:03:28 postgres: integration_test
int_test_new 129.80.17.25(54370) idle in transaction
postgres  8143 30463  4 11:05 pts/0    00:03:24 postgres: integration_test
int_test_new 129.80.17.25(54371) idle in transaction
postgres  8144 30463  4 11:05 pts/0    00:03:56 postgres: integration_test
int_test_new 129.80.17.25(54372) idle in transaction
postgres  8145 30463  2 11:05 pts/0    00:01:48 postgres: integration_test
int_test_new 129.80.17.25(54373) idle in transaction
postgres  8146 30463  4 11:05 pts/0    00:03:13 postgres: integration_test
int_test_new 129.80.17.25(54374) idle in transaction
postgres  8147 30463  4 11:05 pts/0    00:03:21 postgres: integration_test
int_test_new 129.80.17.25(54375) idle in transaction
postgres  8148 30463  4 11:05 pts/0    00:03:29 postgres: integration_test
int_test_new 129.80.17.25(54376) idle in transaction
postgres  8149 30463  2 11:05 pts/0    00:01:37 postgres: integration_test
int_test_new 129.80.17.25(54377) idle in transaction
postgres  8150 30463  4 11:05 pts/0    00:03:30 postgres: integration_test
int_test_new 129.80.17.25(54378) idle in transaction
postgres  8151 30463  5 11:05 pts/0    00:04:16 postgres: integration_test
int_test_new 129.80.17.25(54379) idle in transaction
postgres  8152 30463  1 11:05 pts/0    00:01:35 postgres: integration_test
int_test_new 129.80.17.25(54380) idle in transaction
postgres  8153 30463  4 11:05 pts/0    00:03:26 postgres: integration_test
int_test_new 129.80.17.25(54381) idle in transaction
postgres  9114 30463  0 12:24 pts/0    00:00:00 postgres: creagrs creagrs
129.80.80.164(50863) idle

Process 21064 is not moving, and other manual vacuum attempt block at some
point.

(gdb) bt
#0  0x401f1e78 in semop () from /lib/i686/libc.so.6
#1  0x0834ba38 in ?? ()
#2  0x0816e7d4 in PGSemaphoreLock ()
#3  0x08192936 in ProcWaitForSignal ()
#4  0x08187bce in LockBufferForCleanup ()
#5  0x0811c37c in lazy_vacuum_heap ()
#6  0x0811bcb0 in lazy_scan_heap ()
#7  0x0811baad in lazy_vacuum_rel ()
#8  0x081180b9 in vacuum_rel ()
#9  0x0811763b in vacuum ()
#10 0x081703fd in autovacuum_do_vac_analyze ()
#11 0x08170017 in do_autovacuum ()
#12 0x0816fb4a in AutoVacMain ()
#13 0x0816f831 in autovac_start ()
#14 0x08175202 in ServerLoop ()
#15 0x0817441a in PostmasterMain ()
#16 0x0813be94 in main ()
#17 0x4012bc57 in __libc_start_main () from /lib/i686/libc.so.6

The last log entries (for autovacuum) are (now is Thu Oct 13 12:32:18 MDT 2005):
Oct 12 13:48:37 annette postgres810[20936]: [2-1] LOG:  autovacuum: processing
database "int_test_new"
Oct 12 13:49:37 annette postgres810[20947]: [2-1] LOG:  autovacuum: processing
database "creagrs"
Oct 12 13:50:37 annette postgres810[20957]: [2-1] LOG:  autovacuum: processing
database "postgres"
Oct 12 13:51:37 annette postgres810[20967]: [2-1] LOG:  autovacuum: processing
database "unit_test"
Oct 12 13:52:37 annette postgres810[20977]: [2-1] LOG:  autovacuum: processing
database "template1"
Oct 12 13:54:11 annette postgres810[21000]: [2-1] LOG:  autovacuum: processing
database "int_test_new"
Oct 12 13:55:16 annette postgres810[21016]: [2-1] LOG:  autovacuum: processing
database "creagrs"
Oct 12 13:56:16 annette postgres810[21033]: [2-1] LOG:  autovacuum: processing
database "postgres"
Oct 12 13:57:16 annette postgres810[21043]: [2-1] LOG:  autovacuum: processing
database "unit_test"
Oct 12 13:58:16 annette postgres810[21053]: [2-1] LOG:  autovacuum: processing
database "template1"
Oct 12 13:59:16 annette postgres810[21064]: [2-1] LOG:  autovacuum: processing
database "int_test_new"


creagrs=# select version();                                                 version                       
-------------------------------------------------------------------------------
----------------------------PostgreSQL 8.1beta2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.3.1
(Mandrake Linux 9.2 3.3.1-2mdk)


--
Robert Creager
Advisory Software Engineer
Data Management Group
Sun Microsystems
Robert.Creager@Sun.com
303.673.2365 Office
888.912.4458 Pager


Re: auto vacuum lock on 8.1beta1

From
Tom Lane
Date:
Robert Creager <Robert.Creager@Sun.com> writes:
> Might this be the same problem as the recent thread "database vacuum from cron
> hanging" where Tom is: "I'm busy volatile-izing all the code in bufmgr.c ...
> should be able to commit a fix soon."?

Seems reasonably likely, seeing that the original report involved gcc
3.3.something IIRC, and you're using 3.3.1.  Is this an SMP box?  The
bug could theoretically manifest on a uniprocessor but it seems more
likely to happen on a multiprocessor.

Too bad you didn't have it built with --enable-debug; I can't think of
any very easy way to verify a negative refcount for that buffer without
gdb support.

You could try inspecting the assembly code generated for PinBuffer, as
we did with Kevin's compiler.  If it's generating the same code sequence
then that would make it pretty likely that you're seeing the same thing.

The volatile patch should be available in last night's nightly snapshot,
if you just want to update.
        regards, tom lane


Re: auto vacuum lock on 8.1beta1

From
"Kevin Grittner"
Date:
I can confirm that the patch was in the snapshot I picked up this
morning at about 10:30 CDT.  We've been using it since then and
have not seen the problem in spite of attempting to provoke it with
database vacuums.
-Kevin


>>> Tom Lane <tgl@sss.pgh.pa.us> 10/13/05 2:09 PM >>>
Robert Creager <Robert.Creager@Sun.com> writes:
> Might this be the same problem as the recent thread "database vacuum from cron
> hanging" where Tom is: "I'm busy volatile-izing all the code in bufmgr.c ...
> should be able to commit a fix soon."?

Seems reasonably likely, seeing that the original report involved gcc
3.3.something IIRC, and you're using 3.3.1.  Is this an SMP box?  The
bug could theoretically manifest on a uniprocessor but it seems more
likely to happen on a multiprocessor.

Too bad you didn't have it built with --enable-debug; I can't think of
any very easy way to verify a negative refcount for that buffer without
gdb support.

You could try inspecting the assembly code generated for PinBuffer, as
we did with Kevin's compiler.  If it's generating the same code sequence
then that would make it pretty likely that you're seeing the same thing.

The volatile patch should be available in last night's nightly snapshot,
if you just want to update.
        regards, tom lane



Re: auto vacuum lock on 8.1beta1

From
Robert Creager
Date:
On Thu, 13 Oct 2005 15:09:58 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Robert Creager <Robert.Creager@Sun.com> writes:
> > Might this be the same problem as the recent thread "database vacuum from
> > cron hanging" where Tom is: "I'm busy volatile-izing all the code in
> > bufmgr.c ... should be able to commit a fix soon."?
> 
> Seems reasonably likely, seeing that the original report involved gcc
> 3.3.something IIRC, and you're using 3.3.1.  Is this an SMP box?  The
> bug could theoretically manifest on a uniprocessor but it seems more
> likely to happen on a multiprocessor.

Yes, it's a dual Xenon box.

> 
> Too bad you didn't have it built with --enable-debug; I can't think of
> any very easy way to verify a negative refcount for that buffer without
> gdb support.

I just downloaded last nights snapshot.  I'll build with debug and try it out.

Thanks,
Rob


Re: auto vacuum lock on 8.1beta1

From
Robert Creager
Date:
On Thu, 13 Oct 2005 14:20:46 -0500
Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:

> I can confirm that the patch was in the snapshot I picked up this
> morning at about 10:30 CDT.  We've been using it since then and
> have not seen the problem in spite of attempting to provoke it with
> database vacuums.
>  

Thanks Kevin.  I'm giving it a try now.