Thread: Help: 8.0.3 Vacuum of an empty table never completes ...
G'day folks. We have a production database running 8.0.3 which gets fully pg_dump'd and vacuum analyze'd hourly by cron. Something strange happened to us on the 5AM Friday Nov. 25'th cron run -- the: /usr/local/pgsql/bin/vacuumdb -U postgres --all --analyze --verbose >& $DATE/vacuum.log step in our cron procedure never completed. Strange, since no known event of note happened on Friday since we were all out of the office past Wed. for the american Thanksgiving holiday. Anyway, running the vacuum line by hand shows it getting stuck -- processes the majority of our tables, then just stops, and the backend postmaster just stops accumulating CPU time. Comparing the logs further with when it did complete, it seems that one table in particular (at least) seems afflicted: social=# vacuum verbose analyze agency.swlog_client; hangs up forever -- have to control-c the client. Likewise for w/o 'analyze'. pg-dump'ing the entire database works (phew!) and upon restoring on a backup box, said table can be vacuum'd: social=# vacuum verbose analyze agency.swlog_client; INFO: vacuuming "agency.swlog_client" INFO: index "swlog_client_pkey" now contains 0 row versions in 1 pages DETAIL: 0 index pages have been deleted, 0 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: "swlog_client": found 0 removable, 0 nonremovable row versions in 0 pages DETAIL: 0 dead row versions cannot be removed yet. There were 0 unused item pointers. 0 pages are entirely empty. CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: analyzing "agency.swlog_client" INFO: "swlog_client": scanned 0 of 0 pages, containing 0 live rows and 0 dead rows; 0 rows in sample, 0 estimated total rows VACUUM That's right -- completely empty table -- which is what we actually expect. How should we proceed such that we can learn from this as well as we can proceed and get our entire database vacuuming again successfully? Running on Linux 2.6.8-24.18-smp (SuSE 9.2). No juicy filesystem- related messages in dmesg nor /var/log/messages. 11% disk used on the postgres-related partition. The table in question is defined as: social=# \d agency.swlog_client; Table "agency.swlog_client" Column | Type | Modifiers --------+--------+----------- swlog | bigint | not null client | bigint | not null Indexes: "swlog_client_pkey" PRIMARY KEY, btree (swlog, client) Foreign-key constraints: "$2" FOREIGN KEY (client) REFERENCES agency.client(id) "$1" FOREIGN KEY (swlog) REFERENCESagency.swlog(id) And the two fk'd tables: social=# select count(*) from agency.client; count ------- 0 (1 row) social=# select count(*) from agency.swlog; count ------- 69 We doubt that there could be any strange oddball extremely longrunning transaction in any of those related tables gumming up this table. Finally, the only possibly potentially interesting event database- wise happened on Wed. Nov. 23'rd -- we SIGHUP'd the postmaster to have it learn a higher value for work_mem (10240, up from default of 1024). But the hourly crons went great for the subsequent two days. maintenance_work_mem is still at the default of 16384. Many thanks in advance! James ---- James Robinson Socialserve.com
James Robinson <jlrobins@socialserve.com> writes: > Comparing the logs further with when it did complete, it seems that > one table in particular (at least) seems afflicted: > social=# vacuum verbose analyze agency.swlog_client; > hangs up forever -- have to control-c the client. Likewise for w/o > 'analyze'. Given that it's not eating CPU time, one would guess that it's blocked waiting for a lock. Can you find any relevant locks in pg_locks? regards, tom lane
As fate would have it, the vacuumdb frontend and backend which were initially afflicted are still in existence: sscadmin 19236 19235 0 Nov25 ? 00:00:00 /usr/local/pgsql/bin/ vacuumdb -U postgres --all --analyze --verbose postgres 19244 3596 0 Nov25 ? 00:00:02 postgres: postgres social [local] VACUUM pid 19244. And here's pg_locks: social=# select * from pg_locks; relation | database | transaction | pid | mode | granted ----------+----------+-------------+-------+-------------------------- +--------- | | 38790657 | 19244 | ExclusiveLock | t 6586066 | 6585892 | | 28406 | ShareUpdateExclusiveLock | f | | 39097312 | 28861 | ExclusiveLock | t | | 39089744 | 28756 | ExclusiveLock | t 6586066 | 6585892 | | 28756 | ShareUpdateExclusiveLock | f 6586066 | 6585892 | | 19244 | ShareUpdateExclusiveLock | t 6586066 | 6585892 | | 19244 | ShareUpdateExclusiveLock | t 8417138 | 6585892 | | 19244 | ShareUpdateExclusiveLock | t 16839 | 6585892 | | 28861 | AccessShareLock | t | | 39063661 | 28560 | ExclusiveLock | t | | 39056736 | 28406 | ExclusiveLock | t 6586066 | 6585892 | | 28560 | ShareUpdateExclusiveLock | f (12 rows) pid ------- 19244 28406 28560 28756 28861 (5 rows) Of those 5 pids: 19244 -- vaccuum backend initally afflicted -- status in argv: 'postgres: postgres social [local] VACUUM' 28406 -- a 10AM today vacuum started up by cron this morning after I got things half-way working again early in the diagnosis of this situation. args: 'postgres: postgres social [local] VACUUM waiting' 28560 -- a 10:16 today by-hand vacuum session futher in diagnosis land. args: 'postgres: postgres social [local] VACUUM waiting' 28756 -- 11AM cron'd process. Yes, I see a quickly mounting issue here. args: 'postgres: postgres social [local] VACUUM waiting' 28861 -- production servicing backend, now back in idle state. [ not in tx idle by regular idle ]. On Nov 28, 2005, at 11:09 AM, Tom Lane wrote: > James Robinson <jlrobins@socialserve.com> writes: >> Comparing the logs further with when it did complete, it seems that >> one table in particular (at least) seems afflicted: > >> social=# vacuum verbose analyze agency.swlog_client; > >> hangs up forever -- have to control-c the client. Likewise for w/o >> 'analyze'. > > Given that it's not eating CPU time, one would guess that it's blocked > waiting for a lock. Can you find any relevant locks in pg_locks? > > regards, tom lane ---- James Robinson Socialserve.com
James Robinson <jlrobins@socialserve.com> writes: > As fate would have it, the vacuumdb frontend and backend which were > initially afflicted are still in existence: OK, so pid 19244 isn't blocked on any lmgr lock; else we'd see an entry with granted = f for it in pg_locks. It could be blocked on a lower level lock though. Can you attach to that PID with gdb and get a stack trace? Something like (as postgres user) gdb /path/to/postgres-executable 19244gdb> btgdb> quitok to detach? yes regards, tom lane
Here ya go -- BTW -- your guys support is the _best_. But you know that already: postgres@db01:/home/sscadmin> gdb /usr/local/pgsql/bin/postgres 19244 GNU gdb 6.2.1 Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i586-suse-linux"...Using host libthread_db library "/lib/tls/libthread_db.so.1". Attaching to program: /usr/local/pgsql/bin/postgres, process 19244 Reading symbols from /lib/libz.so.1...done. Loaded symbols for /lib/libz.so.1 Reading symbols from /lib/libreadline.so.5...done. Loaded symbols for /lib/libreadline.so.5 Reading symbols from /lib/libcrypt.so.1...done. Loaded symbols for /lib/libcrypt.so.1 Reading symbols from /lib/libresolv.so.2...done. Loaded symbols for /lib/libresolv.so.2 Reading symbols from /lib/libnsl.so.1...done. Loaded symbols for /lib/libnsl.so.1 Reading symbols from /lib/libdl.so.2...done. Loaded symbols for /lib/libdl.so.2 Reading symbols from /lib/tls/libm.so.6...done. Loaded symbols for /lib/tls/libm.so.6 Reading symbols from /lib/tls/libc.so.6...done. Loaded symbols for /lib/tls/libc.so.6 Reading symbols from /lib/libncurses.so.5...done. Loaded symbols for /lib/libncurses.so.5 Reading symbols from /lib/ld-linux.so.2...done. Loaded symbols for /lib/ld-linux.so.2 Reading symbols from /lib/libnss_compat.so.2...done. Loaded symbols for /lib/libnss_compat.so.2 Reading symbols from /lib/libnss_nis.so.2...done. Loaded symbols for /lib/libnss_nis.so.2 Reading symbols from /lib/libnss_files.so.2...done. Loaded symbols for /lib/libnss_files.so.2 0xffffe410 in ?? () (gdb) bt #0 0xffffe410 in ?? () #1 0xbfffd508 in ?? () #2 0x082aef97 in PqSendBuffer () #3 0xbfffd4f0 in ?? () #4 0xb7ec03e1 in send () from /lib/tls/libc.so.6 #5 0x08137d27 in secure_write () #6 0x0813c2a7 in internal_flush () #7 0x0813c4ff in pq_flush () #8 0x0820bfec in EmitErrorReport () #9 0x0820b5ac in errfinish () #10 0x0811d0a8 in lazy_vacuum_rel () #11 0x0811ac5a in vacuum_rel () #12 0x0811bb93 in vacuum () #13 0x0819c84d in PortalRunUtility () #14 0x0819d9b8 in PortalRun () #15 0x0819b221 in PostgresMain () #16 0x0816ffa9 in ServerLoop () #17 0x08170de9 in PostmasterMain () #18 0x0813e5e5 in main () (gdb) quit ---- James Robinson Socialserve.com
James Robinson <jlrobins@socialserve.com> writes: > (gdb) bt > #0 0xffffe410 in ?? () > #1 0xbfffd508 in ?? () > #2 0x082aef97 in PqSendBuffer () > #3 0xbfffd4f0 in ?? () > #4 0xb7ec03e1 in send () from /lib/tls/libc.so.6 > #5 0x08137d27 in secure_write () > #6 0x0813c2a7 in internal_flush () > #7 0x0813c4ff in pq_flush () > #8 0x0820bfec in EmitErrorReport () > #9 0x0820b5ac in errfinish () > #10 0x0811d0a8 in lazy_vacuum_rel () > #11 0x0811ac5a in vacuum_rel () > #12 0x0811bb93 in vacuum () > #13 0x0819c84d in PortalRunUtility () > #14 0x0819d9b8 in PortalRun () > #15 0x0819b221 in PostgresMain () > #16 0x0816ffa9 in ServerLoop () > #17 0x08170de9 in PostmasterMain () > #18 0x0813e5e5 in main () > (gdb) quit Hmm, what this says is that the backend is blocked trying to send an error or notice message to the client. We can't tell anything about what the message was, but that's not so relevant ... the focus now has to shift to the network or client side, ie, why in the world is the backend waiting on the client to accept a message? Can you get a similar backtrace from the vacuumdb process? (Obviously, give gdb the vacuumdb executable not the postgres one.) regards, tom lane
On Nov 28, 2005, at 11:38 AM, Tom Lane wrote: > Can you get a similar backtrace from the vacuumdb process? > (Obviously, > give gdb the vacuumdb executable not the postgres one.) OK: (gdb) bt #0 0xffffe410 in ?? () #1 0xbfffe4f8 in ?? () #2 0x00000030 in ?? () #3 0x08057b68 in ?? () #4 0xb7e98533 in __write_nocancel () from /lib/tls/libc.so.6 #5 0xb7e4aae6 in _IO_new_file_write () from /lib/tls/libc.so.6 #6 0xb7e4a7e5 in new_do_write () from /lib/tls/libc.so.6 #7 0xb7e4aa63 in _IO_new_file_xsputn () from /lib/tls/libc.so.6 #8 0xb7e413a2 in fputs () from /lib/tls/libc.so.6 #9 0xb7fd8f99 in defaultNoticeProcessor () from /usr/local/pgsql/lib/ libpq.so.4 #10 0xb7fd8fe5 in defaultNoticeReceiver () from /usr/local/pgsql/lib/ libpq.so.4 #11 0xb7fe2d34 in pqGetErrorNotice3 () from /usr/local/pgsql/lib/ libpq.so.4 #12 0xb7fe3921 in pqParseInput3 () from /usr/local/pgsql/lib/libpq.so.4 #13 0xb7fdb174 in parseInput () from /usr/local/pgsql/lib/libpq.so.4 #14 0xb7fdca99 in PQgetResult () from /usr/local/pgsql/lib/libpq.so.4 #15 0xb7fdcc4b in PQexecFinish () from /usr/local/pgsql/lib/libpq.so.4 #16 0x0804942c in vacuum_one_database () #17 0x080497a1 in main () Things to know which could possibly be of use. This cron is kicked off on the backup database box, and the vacuumdb is run via ssh to the primary box. The primary box is running the vacuumdb operation with --analyze --verbose, with the output being streamed to a logfile on the backup box. Lemme guess __write_nocancel calls syscall write, and 0x00000030 might could well be the syscall entry point? Something gumming up the networking or sshd itself could have stopped up the ouput queues, and the backups populated all the way down to this level? If so, only dummies backup / vacuum direct to remote? ---- James Robinson Socialserve.com
James Robinson <jlrobins@socialserve.com> writes: > On Nov 28, 2005, at 11:38 AM, Tom Lane wrote: >> Can you get a similar backtrace from the vacuumdb process? > OK: > (gdb) bt > #0 0xffffe410 in ?? () > #1 0xbfffe4f8 in ?? () > #2 0x00000030 in ?? () > #3 0x08057b68 in ?? () > #4 0xb7e98533 in __write_nocancel () from /lib/tls/libc.so.6 > #5 0xb7e4aae6 in _IO_new_file_write () from /lib/tls/libc.so.6 > #6 0xb7e4a7e5 in new_do_write () from /lib/tls/libc.so.6 > #7 0xb7e4aa63 in _IO_new_file_xsputn () from /lib/tls/libc.so.6 > #8 0xb7e413a2 in fputs () from /lib/tls/libc.so.6 > #9 0xb7fd8f99 in defaultNoticeProcessor () from /usr/local/pgsql/lib/ > libpq.so.4 > #10 0xb7fd8fe5 in defaultNoticeReceiver () from /usr/local/pgsql/lib/ > libpq.so.4 > #11 0xb7fe2d34 in pqGetErrorNotice3 () from /usr/local/pgsql/lib/ > libpq.so.4 > #12 0xb7fe3921 in pqParseInput3 () from /usr/local/pgsql/lib/libpq.so.4 > #13 0xb7fdb174 in parseInput () from /usr/local/pgsql/lib/libpq.so.4 > #14 0xb7fdca99 in PQgetResult () from /usr/local/pgsql/lib/libpq.so.4 > #15 0xb7fdcc4b in PQexecFinish () from /usr/local/pgsql/lib/libpq.so.4 > #16 0x0804942c in vacuum_one_database () > #17 0x080497a1 in main () OK, so evidently the backend is sending NOTICE messages, and the vacuumdb is blocked trying to copy those messages to stderr. > Things to know which could possibly be of use. This cron is kicked > off on the backup database box, and the vacuumdb is run via ssh to > the primary box. The primary box is running the vacuumdb operation > with --analyze --verbose, with the output being streamed to a logfile > on the backup box. Lemme guess __write_nocancel calls syscall write, > and 0x00000030 might could well be the syscall entry point? Something > gumming up the networking or sshd itself could have stopped up the > ouput queues, and the backups populated all the way down to this level? That's what it looks like: the output queue from the vacuumdb has stopped up somehow. Your next move is to look at the state of sshd and whatever is running at the client end of the ssh tunnel. regards, tom lane
On Nov 28, 2005, at 12:00 PM, Tom Lane wrote: > Your next move is to look at the state of sshd > and whatever is running at the client end of the ssh tunnel. backtrace of the sshd doesn't look good: (gdb) bt #0 0xffffe410 in ?? () #1 0xbfffdb48 in ?? () #2 0x080a1e28 in ?? () #3 0x080a1e78 in ?? () #4 0xb7d379fd in ___newselect_nocancel () from /lib/tls/libc.so.6 #5 0x08054d64 in ?? () #6 0x0000000a in ?? () #7 0x080a1e78 in ?? () #8 0x080a1e28 in ?? () #9 0x00000000 in ?? () #10 0xbfffdb30 in ?? () #11 0x00000000 in ?? () #12 0xbfffdb48 in ?? () #13 0x0806c796 in ?? () #14 0x080a9d3c in ?? () #15 0x00000001 in ?? () #16 0xbfffdb64 in ?? () #17 0x08054c3d in ?? () #18 0x00000019 in ?? () #19 0x000acda0 in ?? () #20 0x080a9d3c in ?? () #21 0x00000000 in ?? () #22 0xbfffdb6c in ?? () #23 0x00000000 in ?? () #24 0xbfffdb78 in ?? () ---Type <return> to continue, or q <return> to quit--- #25 0x08055632 in ?? () #26 0xbfffdb6c in ?? () #27 0x00000000 in ?? () #28 0x080a1e78 in ?? () #29 0x08098ee8 in ?? () #30 0x080a1e78 in ?? () #31 0x080a1e28 in ?? () #32 0x00000009 in ?? () #33 0x00000004 in ?? () #34 0x00000001 in ?? () #35 0x00000001 in ?? () #36 0xbfffdbb8 in ?? () #37 0x0805b816 in ?? () #38 0x08098ee8 in ?? () #39 0x080a2e10 in ?? () #40 0x00000007 in ?? () #41 0x08098ee8 in ?? () #42 0x08080fd2 in _IO_stdin_used () #43 0x08098ee8 in ?? () #44 0xbfffdbb8 in ?? () #45 0x080574a3 in ?? () #46 0x00000000 in ?? () #47 0x08098ee8 in ?? () #48 0x08098ee8 in ?? () #49 0x08098f30 in ?? () ---Type <return> to continue, or q <return> to quit--- #50 0x08080fd2 in _IO_stdin_used () #51 0x08098ee8 in ?? () #52 0xbfffeb98 in ?? () #53 0x0804fc90 in ?? () #54 0x08098ee8 in ?? () #55 0x08098f74 in ?? () #56 0x08098f30 in ?? () #57 0xbfffe110 in ?? () #58 0xbfffe110 in ?? () #59 0x0808014a in _IO_stdin_used () #60 0xb7ffad95 in malloc () from /lib/ld-linux.so.2 Previous frame inner to this frame (corrupt stack?) The client-side ssh is worse -- 507 frames before it reports '(corrupt stack?)'. At this moment in time, should we kill off the offending processes from Nov 25 -- starting from client-most side all the way to the vacuumdb process on the production server. The other vacuums would probably then complete happily, and we'd be cool again, eh? I suppose we're darn lucky the process got ultimately gummed up on a table that sees no traffic at all to it, eh? The lock that vacuum has taken out on it would prevent at least some things happening to the table in question -- possibly even new inserts or updates? Could this potentially be alleviated in the future by a little code reordering in vacuumdb or postmaster by completing working on the current table completely before emitting output, either postmaster -> vacuumdb client, or possibly the vacuumdb client -> whatever stdout is directed to so as to get gummed up in a state when no locks are being held? Or would that uglify the code too much and/or people would find that additional buffering a damnable offense? ---- James Robinson Socialserve.com
James Robinson <jlrobins@socialserve.com> writes: > backtrace of the sshd doesn't look good: Stripped executable :-( ... you won't get much info there. What of the client at the far end of the ssh connection? You should probably assume that the blockage is there, rather than in a commonly used bit of software like ssh. > At this moment in time, should we kill off the offending processes > from Nov 25 -- starting from client-most side all the way to the > vacuumdb process on the production server. The other vacuums would > probably then complete happily, and we'd be cool again, eh? If you just want to get out of it, killing the vacuumdb should be the least dangerous way to get out of the problem. I'd suggest taking a little more time to try to find out what's stuck though. > I suppose we're darn lucky the process got ultimately gummed up on a > table that sees no traffic at all to it, eh? The lock that vacuum has > taken out on it would prevent at least some things happening to the > table in question -- possibly even new inserts or updates? No, since it's a lazy vacuum it's not going to block either reads or writes. Just other vacuums and DDL changes on the table. > Could this potentially be alleviated in the future by a little code > reordering in vacuumdb or postmaster by completing working on the > current table completely before emitting output, Wouldn't help. I would imagine that by the time we got to this state, the backend side is quite a few tables past the point where the end client stopped accepting output. You'd normally expect there to be several Kb worth of buffering in the network channel, and the backend isn't going to be blocked till that's *all* used up. BTW, I suppose this was a VACUUM VERBOSE? Without the verbose output, you'd likely not have seen any problem ... regards, tom lane
James Robinson <jlrobins@socialserve.com> writes: > Given the other culprits in play are bash running a straightforward > shellscript line with redirected output to a simple file on a non- > full filesystem, I'm leaning more towards the odds that something > related to the sshd + tcp/ip + ssh client portion of things went > crazy. Yeah, could be. Anyway it doesn't seem like we can learn much more today. You might as well just zing the vacuumdb process and let things get back to normal. If it happens again, we'd have reason to dig deeper. regards, tom lane
On Nov 28, 2005, at 1:46 PM, Tom Lane wrote: > James Robinson <jlrobins@socialserve.com> writes: >> backtrace of the sshd doesn't look good: > > Stripped executable :-( ... you won't get much info there. What of > the client at the far end of the ssh connection? You should probably > assume that the blockage is there, rather than in a commonly used bit > of software like ssh. Ok: cron fired off a bash running our script which performs the backup + vacuuming on the backup box side, and that script was at the point of driving vacuumdb sscadmin 20612 20610 0 Nov25 ? 00:00:00 bash /usr/local/bin/ db_backup.sh sscadmin 20622 20612 0 Nov25 ? 00:00:00 ssh -c blowfish [ ssh identity file + host edited out ] /usr/local/pgsql/bin/vacuumdb -U postgres --all --analyze --verbose [ yes, verbose vacuum. Who knew that'd be the camel-breaking straw ??! ] The lines in the script invoking the ssh'd vacuumdb is: -- # Vacuum all databases, storing log results. $SSHCMD $DBHOST /usr/local/pgsql/bin/vacuumdb -U postgres --all -- analyze --verb ose >& $DATE/vacuum.log -- Unfortunately the dir holding that date + hour's vacuum.log was swept away by the next day's activities. The stuck bash is backtraceable: (gdb) bt #0 0xffffe410 in ?? () #1 0xbffff928 in ?? () #2 0x00000000 in ?? () #3 0xbffff918 in ?? () #4 0xb7ed1513 in __waitpid_nocancel () from /lib/tls/libc.so.6 #5 0x080935bf in default_tty_job_signals () #6 0x080949ca in wait_for () #7 0x0808acd7 in execute_command_internal () #8 0x0808a4f0 in execute_command () #9 0x0808241d in reader_loop () #10 0x08081364 in main () (gdb) Nothing unexpected there. Funny though, file reports /bin/bash as being stripped ( as it does for /usr/bin/ssh and /usr/sbin/sshd ), but I could get far better debugging support from it. Could the debugging issue be ssh / sshd's apparent multithreadedness: gdb /usr/bin/ssh GNU gdb 6.2.1 Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i586-suse-linux"...(no debugging symbols found)...Using host libthread_db library "/lib/tls/libthread_db.so.1". (gdb) run localhost Starting program: /usr/bin/ssh localhost (no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...[Thread debugging using libthread_db enabled] [New Thread 1078408704 (LWP 29932)] (no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...The authenticity of host 'localhost (127.0.0.1)' can't be established. RSA key fingerprint is f4:cd:bc:37:d7:08:bc:4f:04:91:45:9b:44:cf:d5:b9. Are you sure you want to continue connecting (yes/no)? Program received signal SIGINT, Interrupt. [Switching to Thread 1078408704 (LWP 29932)] 0xffffe410 in ?? () (gdb) bt #0 0xffffe410 in ?? () #1 0xbfffb9e8 in ?? () #2 0x00000001 in ?? () #3 0xbfffb3af in ?? () #4 0x402f94b3 in __read_nocancel () from /lib/tls/libc.so.6 #5 0x080742e1 in mkstemp64 () #6 0x080684c8 in error () #7 0x08054e91 in ?? () #8 0xbfffcac0 in ?? () #9 0x00000001 in ?? () #10 0x08096230 in ?? () #11 0xbfffcac0 in ?? () .... I know you have far better things to do than teach someone how to use gdb on multithreaded programs, but could a proper backtrace be salvageable on the ssh client? If you really care, that is, otherwise I'm off to kill that vacuumdb client. > >> At this moment in time, should we kill off the offending processes >> from Nov 25 -- starting from client-most side all the way to the >> vacuumdb process on the production server. The other vacuums would >> probably then complete happily, and we'd be cool again, eh? > > If you just want to get out of it, killing the vacuumdb should be the > least dangerous way to get out of the problem. I'd suggest taking a > little more time to try to find out what's stuck though. Given the other culprits in play are bash running a straightforward shellscript line with redirected output to a simple file on a non- full filesystem, I'm leaning more towards the odds that something related to the sshd + tcp/ip + ssh client portion of things went crazy. Just seems that's where more complexity is -- bash is doing nothing but waiting for the ssh client to finish, and its stdout / stderr were redired to the fileystem before the bash child exec'd ssh. Even if the filesystem became full somehow during that run, one would expect the ssh client's writes to the file being error'd out as opposed to blocking. Funky. > >> I suppose we're darn lucky the process got ultimately gummed up on a >> table that sees no traffic at all to it, eh? The lock that vacuum has >> taken out on it would prevent at least some things happening to the >> table in question -- possibly even new inserts or updates? > > No, since it's a lazy vacuum it's not going to block either reads or > writes. Just other vacuums and DDL changes on the table. Good to know. > >> Could this potentially be alleviated in the future by a little code >> reordering in vacuumdb or postmaster by completing working on the >> current table completely before emitting output, > > Wouldn't help. I would imagine that by the time we got to this state, > the backend side is quite a few tables past the point where the end > client stopped accepting output. You'd normally expect there to be > several Kb worth of buffering in the network channel, and the backend > isn't going to be blocked till that's *all* used up. But when it does get blocked up, albeit after-the-fact, it wouldn't be holding and postgres-level locks. But, since this was lazy vacuum, only other lazy vacuums or DDL would be blocked. But had some other poor chap been running full vacuum, then it could have been worse. > > BTW, I suppose this was a VACUUM VERBOSE? Without the verbose output, > you'd likely not have seen any problem ... Yep -- see above. I do believe we've agreed internally to separate out the vacuuming bit from the backup bit, and have that be driven by a separate cron'd script on the primary database box itself to simplify the system for this case. And with 8.1 we'll move to the autovacuum service, so as to obliterate the need altogether. So, our vacuuming would be happening in a globally far simpler system -- no remote host networking involved at all. Again, thanks again for all of your time and superb support. ---- James Robinson Socialserve.com
On Nov 28, 2005, at 4:13 PM, Tom Lane wrote: > Yeah, could be. Anyway it doesn't seem like we can learn much more > today. You might as well just zing the vacuumdb process and let > things get back to normal. If it happens again, we'd have reason > to dig deeper. Final report [ and apologies to hackers list in general -- sorry for the noise today ]. Killed the vacuumdb frontend. Then went off killing processes spawned by cron on Nov25th related to the cronjob. All of the related backends exited peacefully, and all is well. Manual vacuum verbose analyze completes successfully. One possibly curious thing -- one final process remains on the backup box dated Nov25: root 19912 3 0 Nov25 ? 00:00:12 [pdflush] Coincidence? This is some sort of kernel thread, right? Flushes dirty pages to disk? There are two on this machine: root 9211 3 0 Nov22 ? 00:02:56 [pdflush] root 19912 3 0 Nov25 ? 00:00:12 [pdflush] The Nov25'ths pdflush's pid is suspiciously close to the pids which would be in use around the beginning of the cron'd process. [ checks / var/log/messages ... ] -- yep -- real close -- last known cross- referencable pid is: Nov 25 04:59:01 db02 /usr/sbin/cron[20590]: (root) CMD ( rm -f /var/ spool/cron/lastrun/cron.hourly) and the vacuumdb sshd connection on the production db box is logged at 05:02:22 AM, so that pdflush would have been started real close to the time which the remote backup + vacuum script would have been running. Any Linux 2.6 gurus lurking? Under what circumstances do pdflush'es get spawned? The filesystem upon which the outputs were going is a software raid partition (raid-0? raid-1? Always confuse the two) -- the interleaved one anyway, not mirrored -- formatted reiser3. Neither pdflush instance on this machine was started anywhere near the boot time of the machine -- both much later. Whereas on the production box the two pdflush instances are both dated from machine boot time. Does this perchance indicate unhappiness afoot perhaps hardware-wise? ---- James Robinson Socialserve.com