Thread: Buildfarm alarms
Hi Andrew, I'm just investigating a problem with beta 1 running on Windows 2K and XP, and noticed that neither Snake or Bandicoot have built -HEAD for nearly 3 weeks. I'm investigating why and will fix the problem, but it strikes me that what would be useful is an alarm email from the server to note that a run hasn't been reported for a while would have helped spot this earlier. This could be configured with an admin-specified maximum number of days between reports to allow for those machines that connect far less frequently. Does that sound feasible to you? Regards, Dave.
Dave Page wrote: > > I'm just investigating a problem with beta 1 running on Windows 2K and > XP, and noticed that neither Snake or Bandicoot have built -HEAD for > nearly 3 weeks. I'm investigating why and will fix the problem, but it > strikes me that what would be useful is an alarm email from the server > to note that a run hasn't been reported for a while would have helped > spot this earlier. This could be configured with an admin-specified > maximum number of days between reports to allow for those machines that > connect far less frequently. > > Does that sound feasible to you? > > It could certainly be done. In general, I have generally taken the view that owners have the responsibility for monitoring their own machines. I'll think about it some more. cheers andrew
"Andrew Dunstan" <andrew@dunslane.net> writes: > It could certainly be done. In general, I have generally taken the view > that owners have the responsibility for monitoring their own machines. Sure, but providing them tools to do that seems within buildfarm's purview. For some types of failure, the buildfarm script could make a local notification without bothering the server --- but a timeout on the server side would cover a wider variety of failures, including "this machine is dead and ought to be removed from the farm". regards, tom lane
> -----Original Message----- > From: Andrew Dunstan [mailto:andrew@dunslane.net] > Sent: 24 September 2006 03:13 > To: Dave Page > Cc: pgsql-hackers@postgresql.org > Subject: Re: Buildfarm alarms > > It could certainly be done. In general, I have generally > taken the view > that owners have the responsibility for monitoring their own machines. > I'll think about it some more. We are monitoring the machine, however in this case nothing appeared wrong to the monitoring processes - what had happened was that both had hung or got in an inifinite loop in ECPG-check, the machine was running just fine, and a glance at the process list showed everything I'd expect to see during a normal run. A system for detecting lack of reports from a member would definitely have helped in this case. Regards, Dave
Tom Lane wrote: > "Andrew Dunstan" <andrew@dunslane.net> writes: >> It could certainly be done. In general, I have generally taken the view >> that owners have the responsibility for monitoring their own machines. > > Sure, but providing them tools to do that seems within buildfarm's > purview. > > For some types of failure, the buildfarm script could make a local > notification without bothering the server --- but a timeout on the > server side would cover a wider variety of failures, including "this > machine is dead and ought to be removed from the farm". > Nothing gets removed. If a machine does not report on a branch for 30 days it drops off the dashboard, but apart from that it is a retained historic aretfact. This buildup in history has been gradually slowing down the dashboard, in fact, but Ian Barwick tells me that he has rewritten my lousy SQL to make it fast again, so we'll soon get that working better. Anyway, I think we can do something fairly simply for these alarms. We'll just have a special stanza in the config file, and a cron job that checks, say, once a day, to see if we have exceeded the alarm period on any machine/branch combination. cheers andrew
On Sun, Sep 24, 2006 at 11:51:49AM +0100, Dave Page wrote: > wrong to the monitoring processes - what had happened was that both had > hung or got in an inifinite loop in ECPG-check, the machine was running > just fine Is this still an issue? Can you provide more information? What happens if you run ecpg-check manually? Which test hangs? Joachim -- Joachim Wieland joe@mcknight.de GPG key available
> -----Original Message----- > From: Joachim Wieland [mailto:joe@mcknight.de] > Sent: 25 September 2006 13:25 > To: Dave Page > Cc: Andrew Dunstan; pgsql-hackers@postgresql.org; > meskes@postgresql.org > Subject: Re: [HACKERS] Buildfarm alarms > > On Sun, Sep 24, 2006 at 11:51:49AM +0100, Dave Page wrote: > > wrong to the monitoring processes - what had happened was > that both had > > hung or got in an inifinite loop in ECPG-check, the machine > was running > > just fine > > Is this still an issue? Can you provide more information? > What happens if you > run ecpg-check manually? Which test hangs? Dt_test is the one that hangs - though in actual fact what is happening is that it's crashing and popping up a 'do you wanna debug' dialogue which doesn't get seen in a non-interactive buildfarm run. After saying no to that, the complete list of failed tests is (see Snake/Bandicoot's logs for more info): testing connect/test1.pgc ... FAILED (log) testing compat_informix/dec_test.pgc ... FAILED (output) testing preproc/variable.pgc ... FAILED (log, output) testing pgtypeslib/dt_test.pgc ... FAILED (log, output) testing pgtypeslib/num_test.pgc ... FAILED (output) testing pgtypeslib/num_test2.pgc ... FAILED (output) Regards, Dave.
> -----Original Message----- > From: Michael Meskes [mailto:meskes@postgresql.org] > Sent: 26 September 2006 08:57 > To: Joachim Wieland > Cc: Dave Page; meskes@postgresql.org > Subject: Re: [HACKERS] Buildfarm alarms > > On Mon, Sep 25, 2006 at 09:20:19PM +0200, Joachim Wieland wrote: > > Michael, could you please check and apply? > > Works for me, so I applied it. But then I only tested on Linux. :-) OK, I now see just one, date format related failure: ============== running regression test queries ============== /usr/local/src/postgresql-8.2-dev/src/interfaces/ecpg/test/./tmp_check/i nstall//usr/local/pgsql/bin/createuser -R -S -D -q regressuser1 /usr/local/src/postgresql-8.2-dev/src/interfaces/ecpg/test/./tmp_check/i nstall//usr/local/pgsql/bin/createuser -R -S -D -q connectuser /usr/local/src/postgresql-8.2-dev/src/interfaces/ecpg/test/./tmp_check/i nstall//usr/local/pgsql/bin/createuser -R -S -D -q connectdb testing connect/test1.pgc ... ok testing connect/test2.pgc ... ok testing connect/test3.pgc ... ok testing connect/test4.pgc ... ok testing connect/test5.pgc ... ok testing compat_informix/charfuncs.pgc ... ok testing compat_informix/dec_test.pgc ... ok testing compat_informix/rfmtdate.pgc ... ok testing compat_informix/rfmtlong.pgc ... ok testing compat_informix/rnull.pgc ... ok testing compat_informix/test_informix.pgc ... ok testing compat_informix/test_informix2.pgc ... ok testing preproc/comment.pgc ... ok testing preproc/define.pgc ... ok testing preproc/init.pgc ... ok testing preproc/type.pgc ... ok testing preproc/variable.pgc ... FAILED (log, output) testing preproc/whenever.pgc ... ok testing pgtypeslib/dt_test.pgc ... ok testing pgtypeslib/dt_test2.pgc ... ok testing pgtypeslib/num_test.pgc ... ok testing pgtypeslib/num_test2.pgc ... ok testing sql/array.pgc ... ok testing sql/binary.pgc ... ok testing sql/code100.pgc ... ok testing sql/copystdout.pgc ... ok testing sql/define.pgc ... ok testing sql/desc.pgc ... ok testing sql/dynalloc.pgc ... ok testing sql/dynalloc2.pgc ... ok testing sql/dyntest.pgc ... ok testing sql/execute.pgc ... ok testing sql/fetch.pgc ... ok testing sql/func.pgc ... ok testing sql/indicators.pgc ... ok testing sql/quote.pgc ... ok testing sql/show.pgc ... ok testing sql/update.pgc ... ok testing thread/thread.pgc ... ok testing thread/thread_implicit.pgc ... ok ============== shutting down postmaster ============== server stopped make[1]: *** [check] Error 1 make[1]: Leaving directory `/usr/local/src/postgresql-8.2-dev/src/interfaces/ecpg/test' make: *** [check] Error 2 *** expected/preproc-variable.stderr Fri Sep 8 10:03:40 2006 --- results/preproc-variable.stderr Tue Sep 26 09:51:00 2006 *************** *** 44,50 **** [NO_PID]: sqlca: code: 0, state: 00000 [NO_PID]: ECPGstore_result: line 68: allocating memory for 1 tuples[NO_PID]: sqlca: code: 0, state: 00000 ! [NO_PID]: ECPGget_data line 68: RESULT: 07-14-1987 offset: -1 array: Yes [NO_PID]: sqlca: code: 0, state: 00000 [NO_PID]: ECPGget_data line 68: RESULT: 3 offset: -1 array: Yes [NO_PID]: sqlca:code: 0, state: 00000 --- 44,50 ---- [NO_PID]: sqlca: code: 0, state: 00000 [NO_PID]: ECPGstore_result: line 68: allocating memory for 1 tuples[NO_PID]: sqlca: code: 0, state: 00000 ! [NO_PID]: ECPGget_data line 68: RESULT: 14-07-1987 offset: -1 array: Yes [NO_PID]: sqlca: code: 0, state: 00000 [NO_PID]: ECPGget_data line 68: RESULT: 3 offset: -1 array: Yes [NO_PID]: sqlca:code: 0, state: 00000 *************** *** 60,66 **** [NO_PID]: sqlca: code: 0, state: 00000 [NO_PID]: ECPGstore_result: line 68: allocating memory for 1 tuples[NO_PID]: sqlca: code: 0, state: 00000 ! [NO_PID]: ECPGget_data line 68: RESULT: 07-14-1987 offset: -1 array: Yes [NO_PID]: sqlca: code: 0, state: 00000 [NO_PID]: ECPGget_data line 68: RESULT: 3 offset: -1 array: Yes [NO_PID]: sqlca:code: 0, state: 00000 --- 60,66 ---- [NO_PID]: sqlca: code: 0, state: 00000 [NO_PID]: ECPGstore_result: line 68: allocating memory for 1 tuples[NO_PID]: sqlca: code: 0, state: 00000 ! [NO_PID]: ECPGget_data line 68: RESULT: 14-07-1987 offset: -1 array: Yes [NO_PID]: sqlca: code: 0, state: 00000 [NO_PID]: ECPGget_data line 68: RESULT: 3 offset: -1 array: Yes [NO_PID]: sqlca:code: 0, state: 00000 *** expected/preproc-variable.stdout Fri Sep 8 10:03:40 2006 --- results/preproc-variable.stdout Tue Sep 26 09:51:00 2006 *************** *** 1,5 **** ! Mum , married 07-14-1987, children = 3 ! Dad , born 19610721, married 07-14-1987, children = 3 Child 1 , age = 16 Child 2 , age = 14 Child 3 , age = 9 --- 1,5 ---- ! Mum , married 14-07-1987, children = 3 ! Dad , born 19610721, married 14-07-1987, children = 3 Child 1 , age = 16 Child 2 , age = 14 Child 3 , age = 9 Regards, Dave
On Mon, Sep 25, 2006 at 02:23:39PM +0100, Dave Page wrote: > testing connect/test1.pgc ... FAILED (log) > testing compat_informix/dec_test.pgc ... FAILED (output) > testing preproc/variable.pgc ... FAILED (log, output) > testing pgtypeslib/dt_test.pgc ... FAILED (log, output) > testing pgtypeslib/num_test.pgc ... FAILED (output) > testing pgtypeslib/num_test2.pgc ... FAILED (output) All should be fine now. I tested successfully with both cygwin and MinGW. Joachim -- Joachim Wieland joe@mcknight.de GPG key available
On Tue, Sep 26, 2006 at 09:57:16AM +0100, Dave Page wrote: > OK, I now see just one, date format related failure: > ... Did you run it with Joachim's patch or with up-to-date CVS checkout? It seems to me that you do not have the latest changes to CVS. We added a "set datestyle" to variable.pgc that should fix this failure. Michael -- Michael Meskes Email: Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org) ICQ: 179140304, AIM/Yahoo: michaelmeskes, Jabber: meskes@jabber.org Go SF 49ers! Go Rhein Fire! Use Debian GNU/Linux! Use PostgreSQL!
> -----Original Message----- > From: Michael Meskes [mailto:meskes@postgresql.org] > Sent: 26 September 2006 10:39 > To: Dave Page > Cc: Joachim Wieland; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Buildfarm alarms > > On Tue, Sep 26, 2006 at 09:57:16AM +0100, Dave Page wrote: > > OK, I now see just one, date format related failure: > > ... > > Did you run it with Joachim's patch or with up-to-date CVS > checkout? It > seems to me that you do not have the latest changes to CVS. We added a > "set datestyle" to variable.pgc that should fix this failure. No, I used Joachim's patch as anoncvs hadn't caught up. I'll run it again - thanks. Regards Dave
> -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Dave Page > Sent: 26 September 2006 10:41 > To: Michael Meskes > Cc: Joachim Wieland; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Buildfarm alarms > > > > > -----Original Message----- > > From: Michael Meskes [mailto:meskes@postgresql.org] > > Sent: 26 September 2006 10:39 > > To: Dave Page > > Cc: Joachim Wieland; pgsql-hackers@postgresql.org > > Subject: Re: [HACKERS] Buildfarm alarms > > > > On Tue, Sep 26, 2006 at 09:57:16AM +0100, Dave Page wrote: > > > OK, I now see just one, date format related failure: > > > ... > > > > Did you run it with Joachim's patch or with up-to-date CVS > > checkout? It > > seems to me that you do not have the latest changes to CVS. > We added a > > "set datestyle" to variable.pgc that should fix this failure. > > No, I used Joachim's patch as anoncvs hadn't caught up. I'll run it > again - thanks. Yep - passes all tests now :-) Thanks, Dave.
I wrote: > Tom Lane wrote: > >> "Andrew Dunstan" <andrew@dunslane.net> writes: >> >>> It could certainly be done. In general, I have generally taken the view >>> that owners have the responsibility for monitoring their own machines. >>> >> Sure, but providing them tools to do that seems within buildfarm's >> purview. >> >> For some types of failure, the buildfarm script could make a local >> notification without bothering the server --- but a timeout on the >> server side would cover a wider variety of failures, including "this >> machine is dead and ought to be removed from the farm". >> >> > > Nothing gets removed. If a machine does not report on a branch for 30 days > it drops off the dashboard, but apart from that it is a retained historic > aretfact. This buildup in history has been gradually slowing down the > dashboard, in fact, but Ian Barwick tells me that he has rewritten my > lousy SQL to make it fast again, so we'll soon get that working better. > > Anyway, I think we can do something fairly simply for these alarms. We'll > just have a special stanza in the config file, and a cron job that checks, > say, once a day, to see if we have exceeded the alarm period on any > machine/branch combination. > > OK, I have a gadget to do this in place. It looks at the config of the last build registered on each branch for a stanza called 'alerts' that would look like this: alerts => { HEAD => { alert_after => 24, alert_every => 48 }, REL8_1_STABLE => { alert_after => 168, alert_every =>48 }, } The settings are in hours, so this says that if we haven't seen a HEAD build in 1 day or a stable branch build in 1 week, alert the owner by email, and keep repeating the alert in each case every 2 days. If some intrepid buildfarm owner wants to test this out by using low settings that would trigger an alert that would be good - the cron job runs every hour. cheers andrew
> -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of > Andrew Dunstan > Sent: 27 September 2006 14:56 > To: pgbuildfarm-members@pgfoundry.org > Cc: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Buildfarm alarms > > If some intrepid buildfarm owner wants to test this out by using low > settings that would trigger an alert that would be good - the > cron job > runs every hour. Dunno about intrepid, but I've added the following to Snake: alerts => { HEAD => { alert_after => 1, alert_every => 2 }, REL8_1_STABLE => { alert_after => 168, alert_every => 48}, REL8_0_STABLE => { alert_after => 168, alert_every => 48 }, } Thanks for your work on this. Regards, Dave
On Wed, 27 Sep 2006, Andrew Dunstan wrote: > The settings are in hours, so this says that if we haven't seen a HEAD build > in 1 day or a stable branch build in 1 week, alert the owner by email, and > keep repeating the alert in each case every 2 days. > How does this know if there wasn't a build because nothing in CVS changed over that time period? Especially on the back branches it is normal to go weeks without a build. Kris Jurka
Kris Jurka wrote: > > > On Wed, 27 Sep 2006, Andrew Dunstan wrote: > >> The settings are in hours, so this says that if we haven't seen a >> HEAD build in 1 day or a stable branch build in 1 week, alert the >> owner by email, and keep repeating the alert in each case every 2 days. >> > > How does this know if there wasn't a build because nothing in CVS > changed over that time period? Especially on the back branches it is > normal to go weeks without a build. > > Kris Jurka > Indeed. The short answer is it doesn't. But there is a buildfarm config option to allow you to force a build every so often even if there hasn't been a CVS change, and I'm thinking of providing an option for this to be branch specific. The you would make this setting shorter than your alarm period for any branch you had an alarm set for. cheers andrew
On Wed, Sep 27, 2006 at 01:55:21PM -0400, Andrew Dunstan wrote: > Kris Jurka wrote: > > > > > > On Wed, 27 Sep 2006, Andrew Dunstan wrote: > > > >> The settings are in hours, so this says that if we haven't seen a > >> HEAD build in 1 day or a stable branch build in 1 week, alert the > >> owner by email, and keep repeating the alert in each case every 2 days. > >> > > > > How does this know if there wasn't a build because nothing in CVS > > changed over that time period? Especially on the back branches it is > > normal to go weeks without a build. > > > > Kris Jurka > > > > Indeed. The short answer is it doesn't. But there is a buildfarm config > option to allow you to force a build every so often even if there hasn't > been a CVS change, and I'm thinking of providing an option for this to > be branch specific. The you would make this setting shorter than your > alarm period for any branch you had an alarm set for. Another possibility is just having the client report "no CVS changes detected" to the server, as a form of a ping. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Jim C. Nasby wrote: > > Another possibility is just having the client report "no CVS changes > detected" to the server, as a form of a ping. > I am not going to re-architect the buildfarm client and server for this. I think what I have done will be quite sufficient. I suspect most people will only want alarms on HEAD anyway. cheers andrew