Thread: weird buildfarm failures on arm/mipsel and --with-tcl

weird buildfarm failures on arm/mipsel and --with-tcl

From

Stefan Kaltenbrunner

Date:

08 January 2007, 15:39:07

one of my new buildfarm boxes (an Debian/Etch based ARM box) is
sometimes failing to stop the database during the regression tests:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quagga&dt=2007-01-08%2003:03:03

this only seems to happen sometimes and only if --with-tcl is enabled on
quagga.

lionfish (my mipsel box) is able to trigger that on every build if I
enable --with-tcl but it is nearly impossible to debug it there because
of the low amount of memory and diskspace it has. (two consecutive
failures will run the kernel out of memory due to the resources consumed
by the still running processes).

After the stopdb failure we still have those processes running:

pgbuild   3389  0.0  1.5  39632  4112 ?        S    06:14   0:03
/home/pgbuild/pgbuildfarm/HEAD/inst/bin/postgres -D data
pgbuild   3391  0.0  0.9  39632  2540 ?        Ss   06:14   0:00
postgres: writer process
pgbuild   3392  0.0  0.5  11220  1348 ?        Ss   06:14   0:00
postgres: stats collector process
pgbuild   3488  0.0  2.4  43640  6300 ?        Ss   06:15   0:01
postgres: pgbuild pl_regression [local] idle
pgbuild   3489  0.0  0.0      0     0 ?        Z    06:15   0:00
[postgres] <defunct>


Any ideas on how to debug that any further ?


Stefan

Re: weird buildfarm failures on arm/mipsel and --with-tcl

From

Tom Lane

Date:

08 January 2007, 22:52:26

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> one of my new buildfarm boxes (an Debian/Etch based ARM box) is
> sometimes failing to stop the database during the regression tests:

> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quagga&dt=2007-01-08%2003:03:03

> this only seems to happen sometimes and only if --with-tcl is enabled on
> quagga.

> lionfish (my mipsel box) is able to trigger that on every build if I
> enable --with-tcl but it is nearly impossible to debug it there because
> of the low amount of memory and diskspace it has.

Hm, could pl/tcl somehow be preventing the backend from exiting once
it's run any pl/tcl stuff?  I have no idea why though, and even less
why it wouldn't be repeatable. 

> After the stopdb failure we still have those processes running:
> pgbuild   3488  0.0  2.4  43640  6300 ?        Ss   06:15   0:01
> postgres: pgbuild pl_regression [local] idle

Can you get a stack trace from this process?

> pgbuild   3489  0.0  0.0      0     0 ?        Z    06:15   0:00
> [postgres] <defunct>

This is a bit odd ... if that process is a direct child of the
postmaster it should have been reaped promptly.  Could it be a child
of the other backend?  If so, why was it started?  Please try the
ps again with whatever switch it needs to list parent process ID.
        regards, tom lane

Re: weird buildfarm failures on arm/mipsel and --with-tcl

From

Stefan Kaltenbrunner

Date:

09 January 2007, 03:24:41

Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> one of my new buildfarm boxes (an Debian/Etch based ARM box) is
>> sometimes failing to stop the database during the regression tests:
> 
>> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quagga&dt=2007-01-08%2003:03:03
> 
>> this only seems to happen sometimes and only if --with-tcl is enabled on
>> quagga.
> 
>> lionfish (my mipsel box) is able to trigger that on every build if I
>> enable --with-tcl but it is nearly impossible to debug it there because
>> of the low amount of memory and diskspace it has.
> 
> Hm, could pl/tcl somehow be preventing the backend from exiting once
> it's run any pl/tcl stuff?  I have no idea why though, and even less
> why it wouldn't be repeatable. 
> 
>> After the stopdb failure we still have those processes running:
>> pgbuild   3488  0.0  2.4  43640  6300 ?        Ss   06:15   0:01
>> postgres: pgbuild pl_regression [local] idle
> 
> Can you get a stack trace from this process?

(gdb) bt
#0  0x406b9d80 in __pthread_sigsuspend () from /lib/libpthread.so.0
#1  0x406b8a7c in __pthread_wait_for_restart_signal () from
/lib/libpthread.so.0
#2  0x406b91f8 in pthread_onexit_process () from /lib/libpthread.so.0
#3  0x40438658 in exit () from /lib/libc.so.6
#4  0x40438658 in exit () from /lib/libc.so.6
Previous frame identical to this frame (corrupt stack?)



> 
>> pgbuild   3489  0.0  0.0      0     0 ?        Z    06:15   0:00
>> [postgres] <defunct>
> 
> This is a bit odd ... if that process is a direct child of the
> postmaster it should have been reaped promptly.  Could it be a child
> of the other backend?  If so, why was it started?  Please try the
> ps again with whatever switch it needs to list parent process ID.

looks you are right - the defunct 3489 seems to be a child of 3488:
PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND   1  3389 18341 18341 ?           -1 S     1001   0:03
/home/pgbuild/pgbuildfarm/HEAD/inst/bin/postgres -D data3389  3391  3391  3391 ?           -1 Ss    1001   0:00
postgres:
writer process3389  3392  3392  3392 ?           -1 Ss    1001   0:00 postgres: stats
collector process3389  3488  3488  3488 ?           -1 Ss    1001   0:01 postgres:
pgbuild pl_regression [local] idle3488  3489  3488  3488 ?           -1 Z     1001   0:00 [postgres]
<defunct>


Stefan

Re: weird buildfarm failures on arm/mipsel and --with-tcl

From

Stefan Kaltenbrunner

Date:

24 January 2007, 14:36:06

Stefan Kaltenbrunner wrote:
> Tom Lane wrote:
>> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>>> one of my new buildfarm boxes (an Debian/Etch based ARM box) is
>>> sometimes failing to stop the database during the regression tests:
>>> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quagga&dt=2007-01-08%2003:03:03
>>> this only seems to happen sometimes and only if --with-tcl is enabled on
>>> quagga.
>>> lionfish (my mipsel box) is able to trigger that on every build if I
>>> enable --with-tcl but it is nearly impossible to debug it there because
>>> of the low amount of memory and diskspace it has.
>> Hm, could pl/tcl somehow be preventing the backend from exiting once
>> it's run any pl/tcl stuff?  I have no idea why though, and even less
>> why it wouldn't be repeatable. 
>>
>>> After the stopdb failure we still have those processes running:
>>> pgbuild   3488  0.0  2.4  43640  6300 ?        Ss   06:15   0:01
>>> postgres: pgbuild pl_regression [local] idle
>> Can you get a stack trace from this process?
> 
> (gdb) bt
> #0  0x406b9d80 in __pthread_sigsuspend () from /lib/libpthread.so.0
> #1  0x406b8a7c in __pthread_wait_for_restart_signal () from
> /lib/libpthread.so.0
> #2  0x406b91f8 in pthread_onexit_process () from /lib/libpthread.so.0
> #3  0x40438658 in exit () from /lib/libc.so.6
> #4  0x40438658 in exit () from /lib/libc.so.6
> Previous frame identical to this frame (corrupt stack?)
> 
> 
> 
>>> pgbuild   3489  0.0  0.0      0     0 ?        Z    06:15   0:00
>>> [postgres] <defunct>
>> This is a bit odd ... if that process is a direct child of the
>> postmaster it should have been reaped promptly.  Could it be a child
>> of the other backend?  If so, why was it started?  Please try the
>> ps again with whatever switch it needs to list parent process ID.
> 
> looks you are right - the defunct 3489 seems to be a child of 3488:
> 
>  PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND
>     1  3389 18341 18341 ?           -1 S     1001   0:03
> /home/pgbuild/pgbuildfarm/HEAD/inst/bin/postgres -D data
>  3389  3391  3391  3391 ?           -1 Ss    1001   0:00 postgres:
> writer process
>  3389  3392  3392  3392 ?           -1 Ss    1001   0:00 postgres: stats
> collector process
>  3389  3488  3488  3488 ?           -1 Ss    1001   0:01 postgres:
> pgbuild pl_regression [local] idle
>  3488  3489  3488  3488 ?           -1 Z     1001   0:00 [postgres]
> <defunct>

FWIW - I removed --with-tcl from quagga's configuration about two weeks
ago and it has not failed(for that reason) again. So the issue most
definitly looks like plptcl related ...


Stefan

Re: weird buildfarm failures on arm/mipsel and --with-tcl

From

Tom Lane

Date:

24 January 2007, 14:54:09

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> FWIW - I removed --with-tcl from quagga's configuration about two weeks
> ago and it has not failed(for that reason) again. So the issue most
> definitly looks like plptcl related ...

It sorta looks like Tcl might be installing an atexit() callback that is
doing the Wrong Thing somehow.  What Tcl version do you have installed
exactly, and with what configure options?  (The contents of tclConfig.sh
should be reasonably complete info.)
        regards, tom lane

Re: weird buildfarm failures on arm/mipsel and --with-tcl

From

Stefan Kaltenbrunner

Date:

24 January 2007, 15:24:51

Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> FWIW - I removed --with-tcl from quagga's configuration about two weeks
>> ago and it has not failed(for that reason) again. So the issue most
>> definitly looks like plptcl related ...
> 
> It sorta looks like Tcl might be installing an atexit() callback that is
> doing the Wrong Thing somehow.  What Tcl version do you have installed
> exactly, and with what configure options?  (The contents of tclConfig.sh
> should be reasonably complete info.)

version is 8.4.12-1.1(current debian etch package) and tclConfig.sh is at:

http://www.kaltenbrunner.cc/files/tclConfig.sh.txt


Stefan

Re: weird buildfarm failures on arm/mipsel and --with-tcl

From

Tom Lane

Date:

28 January 2007, 17:07:35

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> FWIW - I removed --with-tcl from quagga's configuration about two weeks
> ago and it has not failed(for that reason) again. So the issue most
> definitly looks like plptcl related ...

I poked around a bit in the Tcl sources, and as far as I can tell it
will never call atexit() on a Linux platform.  So that theory isn't
right.  However, what I found out via strace'ing is that on my own
Linux machine, Tcl launches a second thread as soon as it's loaded.
This thread just does a select() and sits there.  I am not certain
but I think it must be the "notifier" facility seen in the Tcl sources.

What seems likely to me is that the "child process" we see in your ps
output is actually this notifier thread (can you confirm that it's a
thread via additional ps arguments?), and that for some reason the
shutdown of the notifier isn't working right, or maybe the notifier
is dying for some reason and then the main thread can't exit because
it expects the notifier to respond.

Can you "strace -f" the Tcl test and see what happens?  The
way I did it was
createdb pl_regressioncreatelang pltcl pl_regressionpsql pl_regression

-- in another window, identify PID of connected backend, thenstrace -f -p PID 2>strace.out
psql> \i .../pltcl_setup.sqlpsql> \i .../pltcl_queries.sqlpsql> \q

One possibility for fixing it is that maybe we should be making an
effort to execute Tcl_Finalize() before exiting the backend.  If so,
having pltcl set up an on_proc_exit callback to do it would be the
appropriate thing.  This is all speculation though.
        regards, tom lane

Re: weird buildfarm failures on arm/mipsel and --with-tcl

From

Tom Lane

Date:

28 January 2007, 17:50:00

I wrote:
> One possibility for fixing it is that maybe we should be making an
> effort to execute Tcl_Finalize() before exiting the backend.  If so,
> having pltcl set up an on_proc_exit callback to do it would be the
> appropriate thing.  This is all speculation though.

Just for grins I tried this, and I can see by strace'ing that it changes
the process-shutdown-time behavior quite a lot: the secondary thread now
exits first, apparently after being told to via a message from the
primary.  So I think this might indeed be something good to do.  Would
you try the attached patch and see if it changes the behavior on your
systems?  (This patch is very ugly and will draw compiler warnings, but
don't worry about that yet.)
        regards, tom lane

Index: pltcl.c
===================================================================
RCS file: /cvsroot/pgsql/src/pl/tcl/pltcl.c,v
retrieving revision 1.108
diff -c -r1.108 pltcl.c
*** pltcl.c    4 Oct 2006 00:30:14 -0000    1.108
--- pltcl.c    28 Jan 2007 21:45:40 -0000
***************
*** 26,31 ****
--- 26,32 ---- #include "fmgr.h" #include "nodes/makefuncs.h" #include "parser/parse_type.h"
+ #include "storage/ipc.h" #include "tcop/tcopprot.h" #include "utils/builtins.h" #include "utils/lsyscache.h"
***************
*** 245,250 ****
--- 246,253 ----      ************************************************************/     if (!pltcl_be_init_done)     {
+             on_proc_exit(Tcl_Finalize, 0);
+          if (SPI_connect() != SPI_OK_CONNECT)             elog(ERROR, "SPI_connect failed");
pltcl_init_load_unknown(pltcl_norm_interp);

Re: weird buildfarm failures on arm/mipsel and --with-tcl

From

Stefan Kaltenbrunner

Date:

29 January 2007, 14:23:10

Tom Lane wrote:
> I wrote:
>> One possibility for fixing it is that maybe we should be making an
>> effort to execute Tcl_Finalize() before exiting the backend.  If so,
>> having pltcl set up an on_proc_exit callback to do it would be the
>> appropriate thing.  This is all speculation though.
> 
> Just for grins I tried this, and I can see by strace'ing that it changes
> the process-shutdown-time behavior quite a lot: the secondary thread now
> exits first, apparently after being told to via a message from the
> primary.  So I think this might indeed be something good to do.  Would
> you try the attached patch and see if it changes the behavior on your
> systems?  (This patch is very ugly and will draw compiler warnings, but
> don't worry about that yet.)

this patch definitly changes behaviour but not actually for the better :-(

after running the tcl regression tests and exiting psql I'm left with
two(!) backends(and I'm unable to stop the postmaster short of using -m
immedidate):


UID        PID  PPID   LWP  C NLWP STIME TTY          TIME CMD
1000      7191     1  7191  0    1 19:02 pts/0    00:00:03
/home/mastermind/pginst/bin/postgres -D /home/mastermind/data
1000      7202  7191  7202  0    1 19:02 ?        00:00:00 postgres:
writer process
1000      7203  7191  7203  0    1 19:02 ?        00:00:00 postgres:
stats collector process
1000      7235  7191  7235  0    1 19:06 ?        00:00:01 postgres:
mastermind pl_regression [local] idle
1000      7267  7235  7267  0    1 19:08 ?        00:00:00 postgres:
mastermind pl_regression [local] idle

tracefile for 7235:

http://www.kaltenbrunner.cc/files/strace2.out

backtrace for 7235:

(gdb) bt
#0  0x41f04d80 in __pthread_sigsuspend () from /lib/libpthread.so.0
#1  0x41f03a7c in __pthread_wait_for_restart_signal () from
/lib/libpthread.so.0
#2  0x41f05d18 in pthread_key_delete () from /lib/libpthread.so.0
#3  0x41edbe1c in TclpFinalizeThreadDataKey () from /usr/lib/libtcl8.4.so.0
#4  0x41ec96dc in TclFinalizeSynchronization () from /usr/lib/libtcl8.4.so.0
#5  0x41e92040 in Tcl_Finalize () from /usr/lib/libtcl8.4.so.0
#6  0x001ab204 in proc_exit (code=0) at ipc.c:109
#7  0x001be3a8 in PostgresMain (argc=1074783216, argv=<value optimized
out>, username=0x0) at postgres.c:3638
#8  0x0018fb98 in ServerLoop () at postmaster.c:2953
#9  0x00190898 in PostmasterMain (argc=3, argv=0x36e5c0) at postmaster.c:963
#10 0x001468b0 in main (argc=3, argv=<value optimized out>) at main.c:188


Stefan

Re: weird buildfarm failures on arm/mipsel and --with-tcl

From

Alvaro Herrera

Date:

29 January 2007, 14:54:49

Stefan Kaltenbrunner wrote:

> backtrace for 7235:
> 
> (gdb) bt

Please do this in GDB:

thread apply all bt

(or maybe it is
threads apply all bt)

This'll give you backtraces for all threads in the process.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: weird buildfarm failures on arm/mipsel and --with-tcl

From

Tom Lane

Date:

29 January 2007, 15:00:32

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> this patch definitly changes behaviour but not actually for the better :-(

Oh well, it was worth a try.  At this point I think we have to suppose
this is a Tcl bug and not our fault.  Can you reproduce the problem in
bare "tclsh"?  Try
$ tclsh% interp createinterp0% interp create -safeinterp1% exit$

If tclsh doesn't quit when told then it's easy to file (but you might
want to try the latest tcl version first --- they're up to 8.4.14)
        regards, tom lane

Re: weird buildfarm failures on arm/mipsel and --with-tcl

From

Stefan Kaltenbrunner

Date:

29 January 2007, 15:00:43

Alvaro Herrera wrote:
> Stefan Kaltenbrunner wrote:
> 
>> backtrace for 7235:
>>
>> (gdb) bt
> 
> Please do this in GDB:
> 
> thread apply all bt
> 
> (or maybe it is
> threads apply all bt)
> 
> This'll give you backtraces for all threads in the process.

sorry forgot to mention that - the backtrace for the other one is an
endless loop of:

Thread 2 (Thread 32769 (LWP 7267)):
#0  0x4018062c in poll () from /lib/libc.so.6
#1  0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
#2  0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
#3  0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
#4  0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
#5  0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
#6  0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
#7  0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
#8  0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
#9  0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
#10 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
#11 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
#12 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0


Stefan

Re: weird buildfarm failures on arm/mipsel and --with-tcl

From

Stefan Kaltenbrunner

Date:

29 January 2007, 15:14:55

Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> this patch definitly changes behaviour but not actually for the better :-(
> 
> Oh well, it was worth a try.  At this point I think we have to suppose
> this is a Tcl bug and not our fault.  Can you reproduce the problem in
> bare "tclsh"?  Try
> 
>     $ tclsh
>     % interp create
>     interp0
>     % interp create -safe
>     interp1
>     % exit
>     $
> 
> If tclsh doesn't quit when told then it's easy to file (but you might
> want to try the latest tcl version first --- they're up to 8.4.14)

hmm the above sequence just works fine on that box - will experiment
with a more recent version as time permits (there is no other version
available in the debian repo right now which makes this a bit more
difficult).

Stefan