Thread: Recent buildfarm failures involving statement_timeout
The buildfarm has shown several intermittent failures recently: http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=heron&dt=2008-04-20%2000:06:01 http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=cobra&dt=2008-04-26%2004:15:02 http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=heron&dt=2008-04-27%2000:06:01 all with the same signature: *** ./expected/prepared_xacts.out Sun Apr 27 02:06:33 2008 --- ./results/prepared_xacts.out Sun Apr 27 02:10:19 2008 *************** *** 161,166 **** --- 161,167 ---- SELECT * FROM pxtest3; ERROR: canceling statement due to statement timeout reset statement_timeout; + ERROR: canceling statement due to statement timeout -- Disconnect, we will continue testing in a different backend \c- -- There should still be two prepared transactions which so far as I recall is not a failure we've seen in the past (though we still don't have buildfarm infrastructure that makes it easy to confirm or refute that opinion). One explanation for this is that these buildfarm machines were so heavily overloaded that it really did take more than 1 sec to execute "reset statement_timeout". That's not very plausible, though, especially not considering that both machines logged session times (via log_disconnect) that don't seem very far out of line. Another theory is that we broke something recently, but I see no obvious candidates in the CVS logs --- and I've spent the evening running 488 cycles of the parallel regression tests here, with no error. Anyone have a theory? regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Another theory is that we broke something recently, but I see no obvious > candidates in the CVS logs --- and I've spent the evening running 488 > cycles of the parallel regression tests here, with no error. It looks to me like a psql bug rather than a server bug. Psql has some hairy logic to try to remember whether it has handled an error return yet or not and I had a devil of a time trying to keep it working with the concurrent psql patch. One of the failure modes I saw a lot was handling an error twice like this. It's possible it's a race condition where it only happens if psql gets the error at a certain point, such as while fetching the results as opposed to while the execute is pending. However IIRC the logic without concurrent psql is actually fairly straightforward and there are no windows like this. Unless it's actually in libpq and not psql. Hm. Perhaps Bruce's terminate patch wasn't completely reverted and there was a flag somewhere which isn't being reset properly? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
Tom Lane wrote: > The buildfarm has shown several intermittent failures recently: > > http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=heron&dt=2008-04-20%2000:06:01 > http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=cobra&dt=2008-04-26%2004:15:02 > http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=heron&dt=2008-04-27%2000:06:01 > > all with the same signature: > > *** ./expected/prepared_xacts.out Sun Apr 27 02:06:33 2008 > --- ./results/prepared_xacts.out Sun Apr 27 02:10:19 2008 > *************** > *** 161,166 **** > --- 161,167 ---- > SELECT * FROM pxtest3; > ERROR: canceling statement due to statement timeout > reset statement_timeout; > + ERROR: canceling statement due to statement timeout > -- Disconnect, we will continue testing in a different backend > \c - > -- There should still be two prepared transactions > > which so far as I recall is not a failure we've seen in the past > (though we still don't have buildfarm infrastructure that makes it > easy to confirm or refute that opinion). > pgbfprod=# select sysname, snapshot from build_status_log where branch = 'HEAD' and log_stage = 'check.log' and log_text ~ $$\+ ERROR: canceling statement due to statement timeout$$;sysname | snapshot ---------+---------------------fennec | 2008-03-06 01:10:02heron | 2008-03-11 18:06:01heron | 2008-03-11 23:06:01heron | 2008-04-20 00:06:01cobra | 2008-04-26 04:15:02heron | 2008-04-27 00:06:01 (6 rows) So, nothing before March. cheers andrew
Gregory Stark <stark@enterprisedb.com> writes: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: >> Another theory is that we broke something recently, but I see no obvious >> candidates in the CVS logs --- and I've spent the evening running 488 >> cycles of the parallel regression tests here, with no error. > It looks to me like a psql bug rather than a server bug. Interesting theory, but it doesn't fit the facts: the extra error is in the server log too. regards, tom lane
Andrew Dunstan <andrew@dunslane.net> writes: > pgbfprod=# select sysname, snapshot from build_status_log where branch = > 'HEAD' and log_stage = 'check.log' and log_text ~ $$\+ ERROR: canceling > statement due to statement timeout$$; > sysname | snapshot > ---------+--------------------- > fennec | 2008-03-06 01:10:02 > heron | 2008-03-11 18:06:01 > heron | 2008-03-11 23:06:01 > heron | 2008-04-20 00:06:01 > cobra | 2008-04-26 04:15:02 > heron | 2008-04-27 00:06:01 > (6 rows) Hmm, were there any matches in non-HEAD tests? The log_disconnect records in these tests all seem to indicate that the "reset" really did take a full second --- the total session time is 2-seconds-plus rather than 1-second-plus as in a successful run. Since prepared_xacts runs as one member of a 20-way parallel test, I guess it's not out of the question that the machine might sometimes be loaded to the point where that backend doesn't get to run for awhile. This would be particularly true if the kernel scheduler is one that penalizes processes that it's decided are "background" processes --- which might explain why certain platforms seem to show this more than others. Also, I trolled the CVS logs from 2008-03-06 back to the 8.3 branch point and couldn't find anything that seemed likely to correspond to a large performance dropoff. So I'm kind of inclined to write this off as a testing glitch and "fix" it by bumping the timeout up to 2 seconds. But it would be more comforting if we'd seen it in back branches, or could spot some reason why the test would be more likely to fail this way than in the past. The prepared_xacts test itself hasn't changed materially since it was put in... regards, tom lane
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> pgbfprod=# select sysname, snapshot from build_status_log where branch = >> 'HEAD' and log_stage = 'check.log' and log_text ~ $$\+ ERROR: canceling >> statement due to statement timeout$$; >> sysname | snapshot >> ---------+--------------------- >> fennec | 2008-03-06 01:10:02 >> heron | 2008-03-11 18:06:01 >> heron | 2008-03-11 23:06:01 >> heron | 2008-04-20 00:06:01 >> cobra | 2008-04-26 04:15:02 >> heron | 2008-04-27 00:06:01 >> (6 rows) >> > > Hmm, were there any matches in non-HEAD tests? > sysname | snapshot | branch ---------+---------------------+---------------lionfish | 2007-04-19 09:30:27 | REL8_2_STABLElionfish | 2007-05-29 23:30:07| REL8_1_STABLElionfish | 2007-09-22 23:30:07 | REL8_1_STABLE cheers andrew
Andrew Dunstan wrote: > > > Tom Lane wrote: >> Andrew Dunstan <andrew@dunslane.net> writes: >> >>> pgbfprod=# select sysname, snapshot from build_status_log where >>> branch = 'HEAD' and log_stage = 'check.log' and log_text ~ $$\+ >>> ERROR: canceling statement due to statement timeout$$; >>> sysname | snapshot ---------+--------------------- >>> fennec | 2008-03-06 01:10:02 >>> heron | 2008-03-11 18:06:01 >>> heron | 2008-03-11 23:06:01 >>> heron | 2008-04-20 00:06:01 >>> cobra | 2008-04-26 04:15:02 >>> heron | 2008-04-27 00:06:01 >>> (6 rows) >>> >> >> Hmm, were there any matches in non-HEAD tests? >> > > sysname | snapshot | branch > ---------+---------------------+--------------- > lionfish | 2007-04-19 09:30:27 | REL8_2_STABLE > lionfish | 2007-05-29 23:30:07 | REL8_1_STABLE > lionfish | 2007-09-22 23:30:07 | REL8_1_STABLE lionfish is(or rather was - powersupply got fried and I have no replacement yet) one of the slowest machines(250MHZ mipsel,48MB RAM) on the buildfarm (took about 5-6h for a complete run). Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >> sysname | snapshot | branch >> ---------+---------------------+--------------- >> lionfish | 2007-04-19 09:30:27 | REL8_2_STABLE >> lionfish | 2007-05-29 23:30:07 | REL8_1_STABLE >> lionfish | 2007-09-22 23:30:07 | REL8_1_STABLE > lionfish is(or rather was - powersupply got fried and I have no > replacement yet) one of the slowest machines(250MHZ mipsel,48MB RAM) on > the buildfarm (took about 5-6h for a complete run). Unless someone has a better idea, I suggest we just raise the timeouts used in prepared_xacts from 1 sec to 2 sec. That should reduce the recurrence rate to something vanishingly small if the problem is just "it's too slow". If it's that something's actually getting stuck, then we'll continue to see failures and will know that closer investigation is needed. regards, tom lane