Thread: weird buildfarm failures on arm/mipsel and --with-tcl
one of my new buildfarm boxes (an Debian/Etch based ARM box) is sometimes failing to stop the database during the regression tests: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quagga&dt=2007-01-08%2003:03:03 this only seems to happen sometimes and only if --with-tcl is enabled on quagga. lionfish (my mipsel box) is able to trigger that on every build if I enable --with-tcl but it is nearly impossible to debug it there because of the low amount of memory and diskspace it has. (two consecutive failures will run the kernel out of memory due to the resources consumed by the still running processes). After the stopdb failure we still have those processes running: pgbuild 3389 0.0 1.5 39632 4112 ? S 06:14 0:03 /home/pgbuild/pgbuildfarm/HEAD/inst/bin/postgres -D data pgbuild 3391 0.0 0.9 39632 2540 ? Ss 06:14 0:00 postgres: writer process pgbuild 3392 0.0 0.5 11220 1348 ? Ss 06:14 0:00 postgres: stats collector process pgbuild 3488 0.0 2.4 43640 6300 ? Ss 06:15 0:01 postgres: pgbuild pl_regression [local] idle pgbuild 3489 0.0 0.0 0 0 ? Z 06:15 0:00 [postgres] <defunct> Any ideas on how to debug that any further ? Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > one of my new buildfarm boxes (an Debian/Etch based ARM box) is > sometimes failing to stop the database during the regression tests: > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quagga&dt=2007-01-08%2003:03:03 > this only seems to happen sometimes and only if --with-tcl is enabled on > quagga. > lionfish (my mipsel box) is able to trigger that on every build if I > enable --with-tcl but it is nearly impossible to debug it there because > of the low amount of memory and diskspace it has. Hm, could pl/tcl somehow be preventing the backend from exiting once it's run any pl/tcl stuff? I have no idea why though, and even less why it wouldn't be repeatable. > After the stopdb failure we still have those processes running: > pgbuild 3488 0.0 2.4 43640 6300 ? Ss 06:15 0:01 > postgres: pgbuild pl_regression [local] idle Can you get a stack trace from this process? > pgbuild 3489 0.0 0.0 0 0 ? Z 06:15 0:00 > [postgres] <defunct> This is a bit odd ... if that process is a direct child of the postmaster it should have been reaped promptly. Could it be a child of the other backend? If so, why was it started? Please try the ps again with whatever switch it needs to list parent process ID. regards, tom lane
Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >> one of my new buildfarm boxes (an Debian/Etch based ARM box) is >> sometimes failing to stop the database during the regression tests: > >> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quagga&dt=2007-01-08%2003:03:03 > >> this only seems to happen sometimes and only if --with-tcl is enabled on >> quagga. > >> lionfish (my mipsel box) is able to trigger that on every build if I >> enable --with-tcl but it is nearly impossible to debug it there because >> of the low amount of memory and diskspace it has. > > Hm, could pl/tcl somehow be preventing the backend from exiting once > it's run any pl/tcl stuff? I have no idea why though, and even less > why it wouldn't be repeatable. > >> After the stopdb failure we still have those processes running: >> pgbuild 3488 0.0 2.4 43640 6300 ? Ss 06:15 0:01 >> postgres: pgbuild pl_regression [local] idle > > Can you get a stack trace from this process? (gdb) bt #0 0x406b9d80 in __pthread_sigsuspend () from /lib/libpthread.so.0 #1 0x406b8a7c in __pthread_wait_for_restart_signal () from /lib/libpthread.so.0 #2 0x406b91f8 in pthread_onexit_process () from /lib/libpthread.so.0 #3 0x40438658 in exit () from /lib/libc.so.6 #4 0x40438658 in exit () from /lib/libc.so.6 Previous frame identical to this frame (corrupt stack?) > >> pgbuild 3489 0.0 0.0 0 0 ? Z 06:15 0:00 >> [postgres] <defunct> > > This is a bit odd ... if that process is a direct child of the > postmaster it should have been reaped promptly. Could it be a child > of the other backend? If so, why was it started? Please try the > ps again with whatever switch it needs to list parent process ID. looks you are right - the defunct 3489 seems to be a child of 3488: PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND 1 3389 18341 18341 ? -1 S 1001 0:03 /home/pgbuild/pgbuildfarm/HEAD/inst/bin/postgres -D data3389 3391 3391 3391 ? -1 Ss 1001 0:00 postgres: writer process3389 3392 3392 3392 ? -1 Ss 1001 0:00 postgres: stats collector process3389 3488 3488 3488 ? -1 Ss 1001 0:01 postgres: pgbuild pl_regression [local] idle3488 3489 3488 3488 ? -1 Z 1001 0:00 [postgres] <defunct> Stefan
Stefan Kaltenbrunner wrote: > Tom Lane wrote: >> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >>> one of my new buildfarm boxes (an Debian/Etch based ARM box) is >>> sometimes failing to stop the database during the regression tests: >>> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quagga&dt=2007-01-08%2003:03:03 >>> this only seems to happen sometimes and only if --with-tcl is enabled on >>> quagga. >>> lionfish (my mipsel box) is able to trigger that on every build if I >>> enable --with-tcl but it is nearly impossible to debug it there because >>> of the low amount of memory and diskspace it has. >> Hm, could pl/tcl somehow be preventing the backend from exiting once >> it's run any pl/tcl stuff? I have no idea why though, and even less >> why it wouldn't be repeatable. >> >>> After the stopdb failure we still have those processes running: >>> pgbuild 3488 0.0 2.4 43640 6300 ? Ss 06:15 0:01 >>> postgres: pgbuild pl_regression [local] idle >> Can you get a stack trace from this process? > > (gdb) bt > #0 0x406b9d80 in __pthread_sigsuspend () from /lib/libpthread.so.0 > #1 0x406b8a7c in __pthread_wait_for_restart_signal () from > /lib/libpthread.so.0 > #2 0x406b91f8 in pthread_onexit_process () from /lib/libpthread.so.0 > #3 0x40438658 in exit () from /lib/libc.so.6 > #4 0x40438658 in exit () from /lib/libc.so.6 > Previous frame identical to this frame (corrupt stack?) > > > >>> pgbuild 3489 0.0 0.0 0 0 ? Z 06:15 0:00 >>> [postgres] <defunct> >> This is a bit odd ... if that process is a direct child of the >> postmaster it should have been reaped promptly. Could it be a child >> of the other backend? If so, why was it started? Please try the >> ps again with whatever switch it needs to list parent process ID. > > looks you are right - the defunct 3489 seems to be a child of 3488: > > PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND > 1 3389 18341 18341 ? -1 S 1001 0:03 > /home/pgbuild/pgbuildfarm/HEAD/inst/bin/postgres -D data > 3389 3391 3391 3391 ? -1 Ss 1001 0:00 postgres: > writer process > 3389 3392 3392 3392 ? -1 Ss 1001 0:00 postgres: stats > collector process > 3389 3488 3488 3488 ? -1 Ss 1001 0:01 postgres: > pgbuild pl_regression [local] idle > 3488 3489 3488 3488 ? -1 Z 1001 0:00 [postgres] > <defunct> FWIW - I removed --with-tcl from quagga's configuration about two weeks ago and it has not failed(for that reason) again. So the issue most definitly looks like plptcl related ... Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > FWIW - I removed --with-tcl from quagga's configuration about two weeks > ago and it has not failed(for that reason) again. So the issue most > definitly looks like plptcl related ... It sorta looks like Tcl might be installing an atexit() callback that is doing the Wrong Thing somehow. What Tcl version do you have installed exactly, and with what configure options? (The contents of tclConfig.sh should be reasonably complete info.) regards, tom lane
Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >> FWIW - I removed --with-tcl from quagga's configuration about two weeks >> ago and it has not failed(for that reason) again. So the issue most >> definitly looks like plptcl related ... > > It sorta looks like Tcl might be installing an atexit() callback that is > doing the Wrong Thing somehow. What Tcl version do you have installed > exactly, and with what configure options? (The contents of tclConfig.sh > should be reasonably complete info.) version is 8.4.12-1.1(current debian etch package) and tclConfig.sh is at: http://www.kaltenbrunner.cc/files/tclConfig.sh.txt Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > FWIW - I removed --with-tcl from quagga's configuration about two weeks > ago and it has not failed(for that reason) again. So the issue most > definitly looks like plptcl related ... I poked around a bit in the Tcl sources, and as far as I can tell it will never call atexit() on a Linux platform. So that theory isn't right. However, what I found out via strace'ing is that on my own Linux machine, Tcl launches a second thread as soon as it's loaded. This thread just does a select() and sits there. I am not certain but I think it must be the "notifier" facility seen in the Tcl sources. What seems likely to me is that the "child process" we see in your ps output is actually this notifier thread (can you confirm that it's a thread via additional ps arguments?), and that for some reason the shutdown of the notifier isn't working right, or maybe the notifier is dying for some reason and then the main thread can't exit because it expects the notifier to respond. Can you "strace -f" the Tcl test and see what happens? The way I did it was createdb pl_regressioncreatelang pltcl pl_regressionpsql pl_regression -- in another window, identify PID of connected backend, thenstrace -f -p PID 2>strace.out psql> \i .../pltcl_setup.sqlpsql> \i .../pltcl_queries.sqlpsql> \q One possibility for fixing it is that maybe we should be making an effort to execute Tcl_Finalize() before exiting the backend. If so, having pltcl set up an on_proc_exit callback to do it would be the appropriate thing. This is all speculation though. regards, tom lane
I wrote: > One possibility for fixing it is that maybe we should be making an > effort to execute Tcl_Finalize() before exiting the backend. If so, > having pltcl set up an on_proc_exit callback to do it would be the > appropriate thing. This is all speculation though. Just for grins I tried this, and I can see by strace'ing that it changes the process-shutdown-time behavior quite a lot: the secondary thread now exits first, apparently after being told to via a message from the primary. So I think this might indeed be something good to do. Would you try the attached patch and see if it changes the behavior on your systems? (This patch is very ugly and will draw compiler warnings, but don't worry about that yet.) regards, tom lane Index: pltcl.c =================================================================== RCS file: /cvsroot/pgsql/src/pl/tcl/pltcl.c,v retrieving revision 1.108 diff -c -r1.108 pltcl.c *** pltcl.c 4 Oct 2006 00:30:14 -0000 1.108 --- pltcl.c 28 Jan 2007 21:45:40 -0000 *************** *** 26,31 **** --- 26,32 ---- #include "fmgr.h" #include "nodes/makefuncs.h" #include "parser/parse_type.h" + #include "storage/ipc.h" #include "tcop/tcopprot.h" #include "utils/builtins.h" #include "utils/lsyscache.h" *************** *** 245,250 **** --- 246,253 ---- ************************************************************/ if (!pltcl_be_init_done) { + on_proc_exit(Tcl_Finalize, 0); + if (SPI_connect() != SPI_OK_CONNECT) elog(ERROR, "SPI_connect failed"); pltcl_init_load_unknown(pltcl_norm_interp);
Tom Lane wrote: > I wrote: >> One possibility for fixing it is that maybe we should be making an >> effort to execute Tcl_Finalize() before exiting the backend. If so, >> having pltcl set up an on_proc_exit callback to do it would be the >> appropriate thing. This is all speculation though. > > Just for grins I tried this, and I can see by strace'ing that it changes > the process-shutdown-time behavior quite a lot: the secondary thread now > exits first, apparently after being told to via a message from the > primary. So I think this might indeed be something good to do. Would > you try the attached patch and see if it changes the behavior on your > systems? (This patch is very ugly and will draw compiler warnings, but > don't worry about that yet.) this patch definitly changes behaviour but not actually for the better :-( after running the tcl regression tests and exiting psql I'm left with two(!) backends(and I'm unable to stop the postmaster short of using -m immedidate): UID PID PPID LWP C NLWP STIME TTY TIME CMD 1000 7191 1 7191 0 1 19:02 pts/0 00:00:03 /home/mastermind/pginst/bin/postgres -D /home/mastermind/data 1000 7202 7191 7202 0 1 19:02 ? 00:00:00 postgres: writer process 1000 7203 7191 7203 0 1 19:02 ? 00:00:00 postgres: stats collector process 1000 7235 7191 7235 0 1 19:06 ? 00:00:01 postgres: mastermind pl_regression [local] idle 1000 7267 7235 7267 0 1 19:08 ? 00:00:00 postgres: mastermind pl_regression [local] idle tracefile for 7235: http://www.kaltenbrunner.cc/files/strace2.out backtrace for 7235: (gdb) bt #0 0x41f04d80 in __pthread_sigsuspend () from /lib/libpthread.so.0 #1 0x41f03a7c in __pthread_wait_for_restart_signal () from /lib/libpthread.so.0 #2 0x41f05d18 in pthread_key_delete () from /lib/libpthread.so.0 #3 0x41edbe1c in TclpFinalizeThreadDataKey () from /usr/lib/libtcl8.4.so.0 #4 0x41ec96dc in TclFinalizeSynchronization () from /usr/lib/libtcl8.4.so.0 #5 0x41e92040 in Tcl_Finalize () from /usr/lib/libtcl8.4.so.0 #6 0x001ab204 in proc_exit (code=0) at ipc.c:109 #7 0x001be3a8 in PostgresMain (argc=1074783216, argv=<value optimized out>, username=0x0) at postgres.c:3638 #8 0x0018fb98 in ServerLoop () at postmaster.c:2953 #9 0x00190898 in PostmasterMain (argc=3, argv=0x36e5c0) at postmaster.c:963 #10 0x001468b0 in main (argc=3, argv=<value optimized out>) at main.c:188 Stefan
Stefan Kaltenbrunner wrote: > backtrace for 7235: > > (gdb) bt Please do this in GDB: thread apply all bt (or maybe it is threads apply all bt) This'll give you backtraces for all threads in the process. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > this patch definitly changes behaviour but not actually for the better :-( Oh well, it was worth a try. At this point I think we have to suppose this is a Tcl bug and not our fault. Can you reproduce the problem in bare "tclsh"? Try $ tclsh% interp createinterp0% interp create -safeinterp1% exit$ If tclsh doesn't quit when told then it's easy to file (but you might want to try the latest tcl version first --- they're up to 8.4.14) regards, tom lane
Alvaro Herrera wrote: > Stefan Kaltenbrunner wrote: > >> backtrace for 7235: >> >> (gdb) bt > > Please do this in GDB: > > thread apply all bt > > (or maybe it is > threads apply all bt) > > This'll give you backtraces for all threads in the process. sorry forgot to mention that - the backtrace for the other one is an endless loop of: Thread 2 (Thread 32769 (LWP 7267)): #0 0x4018062c in poll () from /lib/libc.so.6 #1 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0 #2 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0 #3 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0 #4 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0 #5 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0 #6 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0 #7 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0 #8 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0 #9 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0 #10 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0 #11 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0 #12 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0 Stefan
Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >> this patch definitly changes behaviour but not actually for the better :-( > > Oh well, it was worth a try. At this point I think we have to suppose > this is a Tcl bug and not our fault. Can you reproduce the problem in > bare "tclsh"? Try > > $ tclsh > % interp create > interp0 > % interp create -safe > interp1 > % exit > $ > > If tclsh doesn't quit when told then it's easy to file (but you might > want to try the latest tcl version first --- they're up to 8.4.14) hmm the above sequence just works fine on that box - will experiment with a more recent version as time permits (there is no other version available in the debian repo right now which makes this a bit more difficult). Stefan