Thread: Help: 8.0.3 Vacuum of an empty table never completes ...

Help: 8.0.3 Vacuum of an empty table never completes ...

From

James Robinson

Date:

28 November 2005, 12:04:48

G'day folks.

We have a production database running 8.0.3 which gets fully
pg_dump'd and vacuum analyze'd hourly by cron. Something strange
happened to us on the 5AM Friday Nov. 25'th cron run -- the:
/usr/local/pgsql/bin/vacuumdb -U postgres --all --analyze --verbose >& $DATE/vacuum.log

step in our cron procedure never completed. Strange, since no known
event of note happened on Friday since we were all out of the office
past Wed. for the american Thanksgiving holiday. Anyway, running the
vacuum line by hand shows it getting stuck -- processes the majority
of our tables, then just stops, and the backend postmaster just stops
accumulating CPU time.

Comparing the logs further with when it did complete, it seems that
one table in particular (at least) seems afflicted:

social=# vacuum verbose analyze agency.swlog_client;

hangs up forever -- have to control-c the client. Likewise for w/o
'analyze'.

pg-dump'ing the entire database works (phew!) and upon restoring on a
backup box, said table can be vacuum'd:

social=# vacuum verbose analyze agency.swlog_client;
INFO: vacuuming "agency.swlog_client"
INFO: index "swlog_client_pkey" now contains 0 row versions in 1 pages
DETAIL: 0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: "swlog_client": found 0 removable, 0 nonremovable row versions
in 0 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: analyzing "agency.swlog_client"
INFO: "swlog_client": scanned 0 of 0 pages, containing 0 live rows
and 0 dead rows; 0 rows in sample, 0 estimated total rows
VACUUM

That's right -- completely empty table -- which is what we actually
expect.

How should we proceed such that we can learn from this as well as we
can proceed and get our entire database vacuuming again successfully?

Running on Linux 2.6.8-24.18-smp (SuSE 9.2). No juicy filesystem-
related messages in dmesg nor /var/log/messages. 11% disk used on the
postgres-related partition.

The table in question is defined as:

social=# \d agency.swlog_client;
Table "agency.swlog_client"
Column | Type | Modifiers
--------+--------+-----------
swlog | bigint | not null
client | bigint | not null
Indexes: "swlog_client_pkey" PRIMARY KEY, btree (swlog, client)
Foreign-key constraints: "$2" FOREIGN KEY (client) REFERENCES agency.client(id) "$1" FOREIGN KEY (swlog)
REFERENCESagency.swlog(id)

And the two fk'd tables:

social=# select count(*) from agency.client;
count
------- 0
(1 row)

social=# select count(*) from agency.swlog;
count
------- 69

We doubt that there could be any strange oddball extremely
longrunning transaction in any of those related tables gumming up
this table.

Finally, the only possibly potentially interesting event database-
wise happened on Wed. Nov. 23'rd -- we SIGHUP'd the postmaster to
have it learn a higher value for work_mem (10240, up from default of
1024). But the hourly crons went great for the subsequent two days.
maintenance_work_mem is still at the default of 16384.

Many thanks in advance!
James

----
James Robinson
Socialserve.com

Re: Help: 8.0.3 Vacuum of an empty table never completes ...

From

Tom Lane

Date:

28 November 2005, 12:09:22

James Robinson <jlrobins@socialserve.com> writes:
> Comparing the logs further with when it did complete, it seems that  
> one table in particular (at least) seems afflicted:

> social=# vacuum verbose analyze agency.swlog_client;

> hangs up forever -- have to control-c the client. Likewise for w/o  
> 'analyze'.

Given that it's not eating CPU time, one would guess that it's blocked
waiting for a lock.  Can you find any relevant locks in pg_locks?
        regards, tom lane

Re: Help: 8.0.3 Vacuum of an empty table never completes ...

From

James Robinson

Date:

28 November 2005, 12:23:54

As fate would have it, the vacuumdb frontend and backend which were  
initially afflicted are still in existence:

sscadmin 19236 19235  0 Nov25 ?        00:00:00 /usr/local/pgsql/bin/ 
vacuumdb -U postgres --all --analyze --verbose
postgres 19244  3596  0 Nov25 ?        00:00:02 postgres: postgres  
social [local] VACUUM

pid 19244.

And here's pg_locks:

social=# select * from pg_locks;
relation | database | transaction |  pid  |           mode            
| granted
----------+----------+-------------+-------+-------------------------- 
+---------          |          |    38790657 | 19244 |  
ExclusiveLock            | t  6586066 |  6585892 |             | 28406 |  
ShareUpdateExclusiveLock | f          |          |    39097312 | 28861 |  
ExclusiveLock            | t          |          |    39089744 | 28756 |  
ExclusiveLock            | t  6586066 |  6585892 |             | 28756 |  
ShareUpdateExclusiveLock | f  6586066 |  6585892 |             | 19244 |  
ShareUpdateExclusiveLock | t  6586066 |  6585892 |             | 19244 |  
ShareUpdateExclusiveLock | t  8417138 |  6585892 |             | 19244 |  
ShareUpdateExclusiveLock | t    16839 |  6585892 |             | 28861 |  
AccessShareLock          | t          |          |    39063661 | 28560 |  
ExclusiveLock            | t          |          |    39056736 | 28406 |  
ExclusiveLock            | t  6586066 |  6585892 |             | 28560 |  
ShareUpdateExclusiveLock | f
(12 rows)

  pid
-------
19244
28406
28560
28756
28861
(5 rows)

Of those 5 pids:

19244 -- vaccuum backend initally afflicted -- status in argv:  
'postgres: postgres social [local] VACUUM'
28406 -- a 10AM today vacuum started up by cron this morning after I  
got things half-way working again early in the diagnosis of this  
situation. args: 'postgres: postgres social [local] VACUUM waiting'
28560 -- a 10:16 today by-hand vacuum session futher in diagnosis  
land. args: 'postgres: postgres social [local] VACUUM waiting'
28756 -- 11AM cron'd process. Yes, I see a quickly mounting issue  
here. args: 'postgres: postgres social [local] VACUUM waiting'
28861 -- production servicing backend, now back in idle state. [ not  
in tx idle by regular idle ].

On Nov 28, 2005, at 11:09 AM, Tom Lane wrote:

> James Robinson <jlrobins@socialserve.com> writes:
>> Comparing the logs further with when it did complete, it seems that
>> one table in particular (at least) seems afflicted:
>
>> social=# vacuum verbose analyze agency.swlog_client;
>
>> hangs up forever -- have to control-c the client. Likewise for w/o
>> 'analyze'.
>
> Given that it's not eating CPU time, one would guess that it's blocked
> waiting for a lock.  Can you find any relevant locks in pg_locks?
>
>             regards, tom lane

----
James Robinson
Socialserve.com

Re: Help: 8.0.3 Vacuum of an empty table never completes ...

From

Tom Lane

Date:

28 November 2005, 12:28:39

James Robinson <jlrobins@socialserve.com> writes:
> As fate would have it, the vacuumdb frontend and backend which were  
> initially afflicted are still in existence:

OK, so pid 19244 isn't blocked on any lmgr lock; else we'd see an entry
with granted = f for it in pg_locks.  It could be blocked on a lower
level lock though.  Can you attach to that PID with gdb and get a stack
trace?  Something like (as postgres user)
gdb /path/to/postgres-executable 19244gdb> btgdb> quitok to detach? yes
        regards, tom lane

Re: Help: 8.0.3 Vacuum of an empty table never completes ...

From

James Robinson

Date:

28 November 2005, 12:32:50

Here ya go -- BTW -- your guys support is the _best_. But you know  
that already:

postgres@db01:/home/sscadmin> gdb /usr/local/pgsql/bin/postgres 19244
GNU gdb 6.2.1
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and  
you are
welcome to change it and/or distribute copies of it under certain  
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for  
details.
This GDB was configured as "i586-suse-linux"...Using host  
libthread_db library "/lib/tls/libthread_db.so.1".

Attaching to program: /usr/local/pgsql/bin/postgres, process 19244
Reading symbols from /lib/libz.so.1...done.
Loaded symbols for /lib/libz.so.1
Reading symbols from /lib/libreadline.so.5...done.
Loaded symbols for /lib/libreadline.so.5
Reading symbols from /lib/libcrypt.so.1...done.
Loaded symbols for /lib/libcrypt.so.1
Reading symbols from /lib/libresolv.so.2...done.
Loaded symbols for /lib/libresolv.so.2
Reading symbols from /lib/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/tls/libm.so.6...done.
Loaded symbols for /lib/tls/libm.so.6
Reading symbols from /lib/tls/libc.so.6...done.
Loaded symbols for /lib/tls/libc.so.6
Reading symbols from /lib/libncurses.so.5...done.
Loaded symbols for /lib/libncurses.so.5
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/libnss_compat.so.2...done.
Loaded symbols for /lib/libnss_compat.so.2
Reading symbols from /lib/libnss_nis.so.2...done.
Loaded symbols for /lib/libnss_nis.so.2
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
0xffffe410 in ?? ()
(gdb) bt
#0  0xffffe410 in ?? ()
#1  0xbfffd508 in ?? ()
#2  0x082aef97 in PqSendBuffer ()
#3  0xbfffd4f0 in ?? ()
#4  0xb7ec03e1 in send () from /lib/tls/libc.so.6
#5  0x08137d27 in secure_write ()
#6  0x0813c2a7 in internal_flush ()
#7  0x0813c4ff in pq_flush ()
#8  0x0820bfec in EmitErrorReport ()
#9  0x0820b5ac in errfinish ()
#10 0x0811d0a8 in lazy_vacuum_rel ()
#11 0x0811ac5a in vacuum_rel ()
#12 0x0811bb93 in vacuum ()
#13 0x0819c84d in PortalRunUtility ()
#14 0x0819d9b8 in PortalRun ()
#15 0x0819b221 in PostgresMain ()
#16 0x0816ffa9 in ServerLoop ()
#17 0x08170de9 in PostmasterMain ()
#18 0x0813e5e5 in main ()
(gdb) quit


----
James Robinson
Socialserve.com

Re: Help: 8.0.3 Vacuum of an empty table never completes ...

From

Tom Lane

Date:

28 November 2005, 12:38:50

James Robinson <jlrobins@socialserve.com> writes:
> (gdb) bt
> #0  0xffffe410 in ?? ()
> #1  0xbfffd508 in ?? ()
> #2  0x082aef97 in PqSendBuffer ()
> #3  0xbfffd4f0 in ?? ()
> #4  0xb7ec03e1 in send () from /lib/tls/libc.so.6
> #5  0x08137d27 in secure_write ()
> #6  0x0813c2a7 in internal_flush ()
> #7  0x0813c4ff in pq_flush ()
> #8  0x0820bfec in EmitErrorReport ()
> #9  0x0820b5ac in errfinish ()
> #10 0x0811d0a8 in lazy_vacuum_rel ()
> #11 0x0811ac5a in vacuum_rel ()
> #12 0x0811bb93 in vacuum ()
> #13 0x0819c84d in PortalRunUtility ()
> #14 0x0819d9b8 in PortalRun ()
> #15 0x0819b221 in PostgresMain ()
> #16 0x0816ffa9 in ServerLoop ()
> #17 0x08170de9 in PostmasterMain ()
> #18 0x0813e5e5 in main ()
> (gdb) quit

Hmm, what this says is that the backend is blocked trying to send an
error or notice message to the client.  We can't tell anything about
what the message was, but that's not so relevant ... the focus now
has to shift to the network or client side, ie, why in the world is
the backend waiting on the client to accept a message?

Can you get a similar backtrace from the vacuumdb process?  (Obviously,
give gdb the vacuumdb executable not the postgres one.)
        regards, tom lane

Re: Help: 8.0.3 Vacuum of an empty table never completes ...

From

James Robinson

Date:

28 November 2005, 12:48:40

On Nov 28, 2005, at 11:38 AM, Tom Lane wrote:
> Can you get a similar backtrace from the vacuumdb process?   
> (Obviously,
> give gdb the vacuumdb executable not the postgres one.)

OK:

(gdb) bt
#0  0xffffe410 in ?? ()
#1  0xbfffe4f8 in ?? ()
#2  0x00000030 in ?? ()
#3  0x08057b68 in ?? ()
#4  0xb7e98533 in __write_nocancel () from /lib/tls/libc.so.6
#5  0xb7e4aae6 in _IO_new_file_write () from /lib/tls/libc.so.6
#6  0xb7e4a7e5 in new_do_write () from /lib/tls/libc.so.6
#7  0xb7e4aa63 in _IO_new_file_xsputn () from /lib/tls/libc.so.6
#8  0xb7e413a2 in fputs () from /lib/tls/libc.so.6
#9  0xb7fd8f99 in defaultNoticeProcessor () from /usr/local/pgsql/lib/ 
libpq.so.4
#10 0xb7fd8fe5 in defaultNoticeReceiver () from /usr/local/pgsql/lib/ 
libpq.so.4
#11 0xb7fe2d34 in pqGetErrorNotice3 () from /usr/local/pgsql/lib/ 
libpq.so.4
#12 0xb7fe3921 in pqParseInput3 () from /usr/local/pgsql/lib/libpq.so.4
#13 0xb7fdb174 in parseInput () from /usr/local/pgsql/lib/libpq.so.4
#14 0xb7fdca99 in PQgetResult () from /usr/local/pgsql/lib/libpq.so.4
#15 0xb7fdcc4b in PQexecFinish () from /usr/local/pgsql/lib/libpq.so.4
#16 0x0804942c in vacuum_one_database ()
#17 0x080497a1 in main ()

Things to know which could possibly be of use. This cron is kicked  
off on the backup database box, and the vacuumdb is run via ssh to  
the primary box. The primary box is running the vacuumdb operation  
with --analyze --verbose, with the output being streamed to a logfile  
on the backup box. Lemme guess __write_nocancel calls syscall write,  
and 0x00000030 might could well be the syscall entry point? Something  
gumming up the networking or sshd itself could have stopped up the  
ouput queues, and the backups populated all the way down to this level?

If so, only dummies backup / vacuum direct to remote?

----
James Robinson
Socialserve.com

Re: Help: 8.0.3 Vacuum of an empty table never completes ...

From

Tom Lane

Date:

28 November 2005, 13:00:08

James Robinson <jlrobins@socialserve.com> writes:
> On Nov 28, 2005, at 11:38 AM, Tom Lane wrote:
>> Can you get a similar backtrace from the vacuumdb process?   

> OK:

> (gdb) bt
> #0  0xffffe410 in ?? ()
> #1  0xbfffe4f8 in ?? ()
> #2  0x00000030 in ?? ()
> #3  0x08057b68 in ?? ()
> #4  0xb7e98533 in __write_nocancel () from /lib/tls/libc.so.6
> #5  0xb7e4aae6 in _IO_new_file_write () from /lib/tls/libc.so.6
> #6  0xb7e4a7e5 in new_do_write () from /lib/tls/libc.so.6
> #7  0xb7e4aa63 in _IO_new_file_xsputn () from /lib/tls/libc.so.6
> #8  0xb7e413a2 in fputs () from /lib/tls/libc.so.6
> #9  0xb7fd8f99 in defaultNoticeProcessor () from /usr/local/pgsql/lib/ 
> libpq.so.4
> #10 0xb7fd8fe5 in defaultNoticeReceiver () from /usr/local/pgsql/lib/ 
> libpq.so.4
> #11 0xb7fe2d34 in pqGetErrorNotice3 () from /usr/local/pgsql/lib/ 
> libpq.so.4
> #12 0xb7fe3921 in pqParseInput3 () from /usr/local/pgsql/lib/libpq.so.4
> #13 0xb7fdb174 in parseInput () from /usr/local/pgsql/lib/libpq.so.4
> #14 0xb7fdca99 in PQgetResult () from /usr/local/pgsql/lib/libpq.so.4
> #15 0xb7fdcc4b in PQexecFinish () from /usr/local/pgsql/lib/libpq.so.4
> #16 0x0804942c in vacuum_one_database ()
> #17 0x080497a1 in main ()

OK, so evidently the backend is sending NOTICE messages, and the
vacuumdb is blocked trying to copy those messages to stderr.

> Things to know which could possibly be of use. This cron is kicked  
> off on the backup database box, and the vacuumdb is run via ssh to  
> the primary box. The primary box is running the vacuumdb operation  
> with --analyze --verbose, with the output being streamed to a logfile  
> on the backup box. Lemme guess __write_nocancel calls syscall write,  
> and 0x00000030 might could well be the syscall entry point? Something  
> gumming up the networking or sshd itself could have stopped up the  
> ouput queues, and the backups populated all the way down to this level?

That's what it looks like: the output queue from the vacuumdb has
stopped up somehow.  Your next move is to look at the state of sshd
and whatever is running at the client end of the ssh tunnel.
        regards, tom lane

Re: Help: 8.0.3 Vacuum of an empty table never completes ...

From

James Robinson

Date:

28 November 2005, 13:47:28

On Nov 28, 2005, at 12:00 PM, Tom Lane wrote:

> Your next move is to look at the state of sshd
> and whatever is running at the client end of the ssh tunnel.

backtrace of the sshd doesn't look good:

(gdb) bt
#0  0xffffe410 in ?? ()
#1  0xbfffdb48 in ?? ()
#2  0x080a1e28 in ?? ()
#3  0x080a1e78 in ?? ()
#4  0xb7d379fd in ___newselect_nocancel () from /lib/tls/libc.so.6
#5  0x08054d64 in ?? ()
#6  0x0000000a in ?? ()
#7  0x080a1e78 in ?? ()
#8  0x080a1e28 in ?? ()
#9  0x00000000 in ?? ()
#10 0xbfffdb30 in ?? ()
#11 0x00000000 in ?? ()
#12 0xbfffdb48 in ?? ()
#13 0x0806c796 in ?? ()
#14 0x080a9d3c in ?? ()
#15 0x00000001 in ?? ()
#16 0xbfffdb64 in ?? ()
#17 0x08054c3d in ?? ()
#18 0x00000019 in ?? ()
#19 0x000acda0 in ?? ()
#20 0x080a9d3c in ?? ()
#21 0x00000000 in ?? ()
#22 0xbfffdb6c in ?? ()
#23 0x00000000 in ?? ()
#24 0xbfffdb78 in ?? ()
---Type <return> to continue, or q <return> to quit---
#25 0x08055632 in ?? ()
#26 0xbfffdb6c in ?? ()
#27 0x00000000 in ?? ()
#28 0x080a1e78 in ?? ()
#29 0x08098ee8 in ?? ()
#30 0x080a1e78 in ?? ()
#31 0x080a1e28 in ?? ()
#32 0x00000009 in ?? ()
#33 0x00000004 in ?? ()
#34 0x00000001 in ?? ()
#35 0x00000001 in ?? ()
#36 0xbfffdbb8 in ?? ()
#37 0x0805b816 in ?? ()
#38 0x08098ee8 in ?? ()
#39 0x080a2e10 in ?? ()
#40 0x00000007 in ?? ()
#41 0x08098ee8 in ?? ()
#42 0x08080fd2 in _IO_stdin_used ()
#43 0x08098ee8 in ?? ()
#44 0xbfffdbb8 in ?? ()
#45 0x080574a3 in ?? ()
#46 0x00000000 in ?? ()
#47 0x08098ee8 in ?? ()
#48 0x08098ee8 in ?? ()
#49 0x08098f30 in ?? ()
---Type <return> to continue, or q <return> to quit---
#50 0x08080fd2 in _IO_stdin_used ()
#51 0x08098ee8 in ?? ()
#52 0xbfffeb98 in ?? ()
#53 0x0804fc90 in ?? ()
#54 0x08098ee8 in ?? ()
#55 0x08098f74 in ?? ()
#56 0x08098f30 in ?? ()
#57 0xbfffe110 in ?? ()
#58 0xbfffe110 in ?? ()
#59 0x0808014a in _IO_stdin_used ()
#60 0xb7ffad95 in malloc () from /lib/ld-linux.so.2
Previous frame inner to this frame (corrupt stack?)

The client-side ssh is worse -- 507 frames before it reports  
'(corrupt stack?)'.

At this moment in time, should we kill off the offending processes  
from Nov 25 -- starting from client-most side all the way to the  
vacuumdb process on the production server. The other vacuums would  
probably then complete happily, and we'd be cool again, eh?

I suppose we're darn lucky the process got ultimately gummed up on a  
table that sees no traffic at all to it, eh? The lock that vacuum has  
taken out on it would prevent at least some things happening to the  
table in question -- possibly even new inserts or updates?

Could this potentially be alleviated in the future by a little code  
reordering in vacuumdb or postmaster by completing working on the  
current table completely before emitting output, either postmaster ->  
vacuumdb client, or possibly the vacuumdb client -> whatever stdout  
is directed to so as to get gummed up in a state when no locks are  
being held? Or would that uglify the code too much and/or people  
would find that additional buffering a damnable offense?

----
James Robinson
Socialserve.com

Re: Help: 8.0.3 Vacuum of an empty table never completes ...

From

Tom Lane

Date:

28 November 2005, 17:20:29

James Robinson <jlrobins@socialserve.com> writes:
> backtrace of the sshd doesn't look good:

Stripped executable :-( ... you won't get much info there.  What of
the client at the far end of the ssh connection?  You should probably
assume that the blockage is there, rather than in a commonly used bit
of software like ssh.

> At this moment in time, should we kill off the offending processes  
> from Nov 25 -- starting from client-most side all the way to the  
> vacuumdb process on the production server. The other vacuums would  
> probably then complete happily, and we'd be cool again, eh?

If you just want to get out of it, killing the vacuumdb should be the
least dangerous way to get out of the problem.  I'd suggest taking a
little more time to try to find out what's stuck though.

> I suppose we're darn lucky the process got ultimately gummed up on a  
> table that sees no traffic at all to it, eh? The lock that vacuum has  
> taken out on it would prevent at least some things happening to the  
> table in question -- possibly even new inserts or updates?

No, since it's a lazy vacuum it's not going to block either reads or
writes.  Just other vacuums and DDL changes on the table.

> Could this potentially be alleviated in the future by a little code  
> reordering in vacuumdb or postmaster by completing working on the  
> current table completely before emitting output,

Wouldn't help.  I would imagine that by the time we got to this state,
the backend side is quite a few tables past the point where the end
client stopped accepting output.  You'd normally expect there to be
several Kb worth of buffering in the network channel, and the backend
isn't going to be blocked till that's *all* used up.

BTW, I suppose this was a VACUUM VERBOSE?  Without the verbose output,
you'd likely not have seen any problem ...
        regards, tom lane

Re: Help: 8.0.3 Vacuum of an empty table never completes ...

From

Tom Lane

Date:

28 November 2005, 17:20:35

James Robinson <jlrobins@socialserve.com> writes:
> Given the other culprits in play are bash running a straightforward  
> shellscript line with redirected output to a simple file on a non- 
> full filesystem, I'm leaning more towards the odds that something  
> related to the sshd + tcp/ip + ssh client portion of things went  
> crazy.

Yeah, could be.  Anyway it doesn't seem like we can learn much more
today.  You might as well just zing the vacuumdb process and let
things get back to normal.  If it happens again, we'd have reason
to dig deeper.
        regards, tom lane

Re: Help: 8.0.3 Vacuum of an empty table never completes ...

From

James Robinson

Date:

28 November 2005, 17:24:16

On Nov 28, 2005, at 1:46 PM, Tom Lane wrote:

> James Robinson <jlrobins@socialserve.com> writes:
>> backtrace of the sshd doesn't look good:
>
> Stripped executable :-( ... you won't get much info there.  What of
> the client at the far end of the ssh connection?  You should probably
> assume that the blockage is there, rather than in a commonly used bit
> of software like ssh.

Ok: cron fired off a bash running our script which performs the  
backup + vacuuming on the backup box side, and that script was at the  
point of driving vacuumdb

sscadmin 20612 20610  0 Nov25 ?        00:00:00 bash /usr/local/bin/ 
db_backup.sh
sscadmin 20622 20612  0 Nov25 ?        00:00:00 ssh -c blowfish [ ssh  
identity file + host edited out ] /usr/local/pgsql/bin/vacuumdb -U  
postgres --all --analyze --verbose

[ yes, verbose vacuum. Who knew that'd be the camel-breaking straw ??! ]

The lines in the script invoking the ssh'd vacuumdb is:
--
# Vacuum all databases, storing log results.
$SSHCMD $DBHOST /usr/local/pgsql/bin/vacuumdb -U postgres --all -- 
analyze --verb
ose >& $DATE/vacuum.log
--

Unfortunately the dir holding that date + hour's vacuum.log was swept  
away by the next day's activities.

The stuck bash is backtraceable:

(gdb) bt
#0  0xffffe410 in ?? ()
#1  0xbffff928 in ?? ()
#2  0x00000000 in ?? ()
#3  0xbffff918 in ?? ()
#4  0xb7ed1513 in __waitpid_nocancel () from /lib/tls/libc.so.6
#5  0x080935bf in default_tty_job_signals ()
#6  0x080949ca in wait_for ()
#7  0x0808acd7 in execute_command_internal ()
#8  0x0808a4f0 in execute_command ()
#9  0x0808241d in reader_loop ()
#10 0x08081364 in main ()
(gdb)

Nothing unexpected there. Funny though, file reports /bin/bash as  
being stripped ( as it does for /usr/bin/ssh and /usr/sbin/sshd ),  
but I could get far better debugging support from it. Could the  
debugging issue be ssh / sshd's apparent multithreadedness:

gdb /usr/bin/ssh
GNU gdb 6.2.1
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and  
you are
welcome to change it and/or distribute copies of it under certain  
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for  
details.
This GDB was configured as "i586-suse-linux"...(no debugging symbols  
found)...Using host libthread_db library "/lib/tls/libthread_db.so.1".

(gdb) run localhost
Starting program: /usr/bin/ssh localhost
(no debugging symbols found)...(no debugging symbols found)...(no  
debugging symbols found)...(no debugging symbols found)...(no  
debugging symbols found)...(no debugging symbols found)...(no  
debugging symbols found)...(no debugging symbols found)...(no  
debugging symbols found)...(no debugging symbols found)...(no  
debugging symbols found)...(no debugging symbols found)...(no  
debugging symbols found)...(no debugging symbols found)...(no  
debugging symbols found)...(no debugging symbols found)...(no  
debugging symbols found)...(no debugging symbols found)...[Thread  
debugging using libthread_db enabled]
[New Thread 1078408704 (LWP 29932)]
(no debugging symbols found)...(no debugging symbols found)...(no  
debugging symbols found)...(no debugging symbols found)...(no  
debugging symbols found)...(no debugging symbols found)...The  
authenticity of host 'localhost (127.0.0.1)' can't be established.
RSA key fingerprint is f4:cd:bc:37:d7:08:bc:4f:04:91:45:9b:44:cf:d5:b9.
Are you sure you want to continue connecting (yes/no)?
Program received signal SIGINT, Interrupt.
[Switching to Thread 1078408704 (LWP 29932)]
0xffffe410 in ?? ()
(gdb) bt
#0  0xffffe410 in ?? ()
#1  0xbfffb9e8 in ?? ()
#2  0x00000001 in ?? ()
#3  0xbfffb3af in ?? ()
#4  0x402f94b3 in __read_nocancel () from /lib/tls/libc.so.6
#5  0x080742e1 in mkstemp64 ()
#6  0x080684c8 in error ()
#7  0x08054e91 in ?? ()
#8  0xbfffcac0 in ?? ()
#9  0x00000001 in ?? ()
#10 0x08096230 in ?? ()
#11 0xbfffcac0 in ?? ()
....

I know you have far better things to do than teach someone how to use  
gdb on multithreaded programs, but could a proper backtrace be  
salvageable on the ssh client? If you really care, that is, otherwise  
I'm off to kill that vacuumdb client.

>
>> At this moment in time, should we kill off the offending processes
>> from Nov 25 -- starting from client-most side all the way to the
>> vacuumdb process on the production server. The other vacuums would
>> probably then complete happily, and we'd be cool again, eh?
>
> If you just want to get out of it, killing the vacuumdb should be the
> least dangerous way to get out of the problem.  I'd suggest taking a
> little more time to try to find out what's stuck though.

Given the other culprits in play are bash running a straightforward  
shellscript line with redirected output to a simple file on a non- 
full filesystem, I'm leaning more towards the odds that something  
related to the sshd + tcp/ip + ssh client portion of things went  
crazy. Just seems that's where more complexity is -- bash is doing  
nothing but waiting for the ssh client to finish, and its stdout /  
stderr were redired to the fileystem before the bash child exec'd  
ssh. Even if the filesystem became full somehow during that run, one  
would expect the ssh client's writes to the file being error'd out as  
opposed to blocking. Funky.

>
>> I suppose we're darn lucky the process got ultimately gummed up on a
>> table that sees no traffic at all to it, eh? The lock that vacuum has
>> taken out on it would prevent at least some things happening to the
>> table in question -- possibly even new inserts or updates?
>
> No, since it's a lazy vacuum it's not going to block either reads or
> writes.  Just other vacuums and DDL changes on the table.

Good to know.

>
>> Could this potentially be alleviated in the future by a little code
>> reordering in vacuumdb or postmaster by completing working on the
>> current table completely before emitting output,
>
> Wouldn't help.  I would imagine that by the time we got to this state,
> the backend side is quite a few tables past the point where the end
> client stopped accepting output.  You'd normally expect there to be
> several Kb worth of buffering in the network channel, and the backend
> isn't going to be blocked till that's *all* used up.

But when it does get blocked up, albeit after-the-fact, it wouldn't  
be holding and postgres-level locks. But, since this was lazy vacuum,  
only other lazy vacuums or DDL would be blocked. But had some other  
poor chap been running full vacuum, then it could have been worse.

>
> BTW, I suppose this was a VACUUM VERBOSE?  Without the verbose output,
> you'd likely not have seen any problem ...

Yep --  see above.

I do believe we've agreed internally to separate out the vacuuming  
bit from the backup bit, and have that be driven by a separate cron'd  
script on the primary database box itself to simplify the system for  
this case. And with 8.1 we'll move to the autovacuum service, so as  
to obliterate the need altogether. So, our vacuuming would be  
happening in a globally far simpler system -- no remote host  
networking involved at all.

Again, thanks again for all of your time and superb support.

----
James Robinson
Socialserve.com

Re: Help: 8.0.3 Vacuum of an empty table never completes ...

From

James Robinson

Date:

28 November 2005, 18:04:06

On Nov 28, 2005, at 4:13 PM, Tom Lane wrote:

> Yeah, could be.  Anyway it doesn't seem like we can learn much more
> today.  You might as well just zing the vacuumdb process and let
> things get back to normal.  If it happens again, we'd have reason
> to dig deeper.

Final report [ and apologies to hackers list in general -- sorry for  
the noise today ].

Killed the vacuumdb frontend. Then went off killing processes spawned  
by cron on Nov25th related to the cronjob. All of the related  
backends exited peacefully, and all is well. Manual vacuum verbose  
analyze completes successfully.

One possibly curious thing -- one final process remains on the backup  
box dated Nov25:

root     19912     3  0 Nov25 ?        00:00:12 [pdflush]

Coincidence? This is some sort of kernel thread, right? Flushes dirty  
pages to disk? There are two on this machine:

root      9211     3  0 Nov22 ?        00:02:56 [pdflush]
root     19912     3  0 Nov25 ?        00:00:12 [pdflush]

The Nov25'ths pdflush's pid is suspiciously close to the pids which  
would be in use around the beginning of the cron'd process. [ checks / 
var/log/messages ... ] -- yep -- real close -- last known cross- 
referencable pid is:

Nov 25 04:59:01 db02 /usr/sbin/cron[20590]: (root) CMD ( rm -f /var/ 
spool/cron/lastrun/cron.hourly)

and the vacuumdb sshd connection on the production db box is logged  
at 05:02:22 AM, so that pdflush would have been started real close to  
the time which the remote backup + vacuum script would have been  
running.

Any Linux 2.6 gurus lurking? Under what circumstances do pdflush'es  
get spawned? The filesystem upon which the outputs were going is a  
software raid partition (raid-0? raid-1? Always confuse the two) --  
the interleaved one anyway, not mirrored -- formatted reiser3.

Neither pdflush instance on this machine was started anywhere near  
the boot time of the machine -- both much later. Whereas on the  
production box the two pdflush instances are both dated from machine   
boot time. Does this perchance indicate unhappiness afoot perhaps  
hardware-wise?

----
James Robinson
Socialserve.com