Thread: Recent buildfarm failures involving statement_timeout

Recent buildfarm failures involving statement_timeout

From
Tom Lane
Date:
The buildfarm has shown several intermittent failures recently:

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=heron&dt=2008-04-20%2000:06:01
http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=cobra&dt=2008-04-26%2004:15:02
http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=heron&dt=2008-04-27%2000:06:01

all with the same signature:

*** ./expected/prepared_xacts.out    Sun Apr 27 02:06:33 2008
--- ./results/prepared_xacts.out    Sun Apr 27 02:10:19 2008
***************
*** 161,166 ****
--- 161,167 ---- SELECT * FROM pxtest3; ERROR:  canceling statement due to statement timeout reset statement_timeout;
+ ERROR:  canceling statement due to statement timeout -- Disconnect, we will continue testing in a different backend
\c- -- There should still be two prepared transactions
 

which so far as I recall is not a failure we've seen in the past
(though we still don't have buildfarm infrastructure that makes it
easy to confirm or refute that opinion).

One explanation for this is that these buildfarm machines were so
heavily overloaded that it really did take more than 1 sec to execute
"reset statement_timeout".  That's not very plausible, though,
especially not considering that both machines logged session times
(via log_disconnect) that don't seem very far out of line.

Another theory is that we broke something recently, but I see no obvious
candidates in the CVS logs --- and I've spent the evening running 488
cycles of the parallel regression tests here, with no error.

Anyone have a theory?
        regards, tom lane


Re: Recent buildfarm failures involving statement_timeout

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Another theory is that we broke something recently, but I see no obvious
> candidates in the CVS logs --- and I've spent the evening running 488
> cycles of the parallel regression tests here, with no error.

It looks to me like a psql bug rather than a server bug. Psql has some hairy
logic to try to remember whether it has handled an error return yet or not and
I had a devil of a time trying to keep it working with the concurrent psql
patch. One of the failure modes I saw a lot was handling an error twice like
this.

It's possible it's a race condition where it only happens if psql gets the
error at a certain point, such as while fetching the results as opposed to
while the execute is pending. However IIRC the logic without concurrent psql
is actually fairly straightforward and there are no windows like this. Unless
it's actually in libpq and not psql. Hm.

Perhaps Bruce's terminate patch wasn't completely reverted and there was a
flag somewhere which isn't being reset properly?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication
support!


Re: Recent buildfarm failures involving statement_timeout

From
Andrew Dunstan
Date:

Tom Lane wrote:
> The buildfarm has shown several intermittent failures recently:
>
> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=heron&dt=2008-04-20%2000:06:01
> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=cobra&dt=2008-04-26%2004:15:02
> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=heron&dt=2008-04-27%2000:06:01
>
> all with the same signature:
>
> *** ./expected/prepared_xacts.out    Sun Apr 27 02:06:33 2008
> --- ./results/prepared_xacts.out    Sun Apr 27 02:10:19 2008
> ***************
> *** 161,166 ****
> --- 161,167 ----
>   SELECT * FROM pxtest3;
>   ERROR:  canceling statement due to statement timeout
>   reset statement_timeout;
> + ERROR:  canceling statement due to statement timeout
>   -- Disconnect, we will continue testing in a different backend
>   \c -
>   -- There should still be two prepared transactions
>
> which so far as I recall is not a failure we've seen in the past
> (though we still don't have buildfarm infrastructure that makes it
> easy to confirm or refute that opinion).
>   

pgbfprod=# select sysname, snapshot from build_status_log where branch = 
'HEAD' and log_stage = 'check.log' and log_text ~ $$\+ ERROR:  canceling 
statement due to statement timeout$$;sysname |      snapshot      
---------+---------------------fennec  | 2008-03-06 01:10:02heron   | 2008-03-11 18:06:01heron   | 2008-03-11
23:06:01heron  | 2008-04-20 00:06:01cobra   | 2008-04-26 04:15:02heron   | 2008-04-27 00:06:01
 
(6 rows)


So, nothing before March.


cheers

andrew


Re: Recent buildfarm failures involving statement_timeout

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>> Another theory is that we broke something recently, but I see no obvious
>> candidates in the CVS logs --- and I've spent the evening running 488
>> cycles of the parallel regression tests here, with no error.

> It looks to me like a psql bug rather than a server bug.

Interesting theory, but it doesn't fit the facts: the extra error is in
the server log too.
        regards, tom lane


Re: Recent buildfarm failures involving statement_timeout

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> pgbfprod=# select sysname, snapshot from build_status_log where branch = 
> 'HEAD' and log_stage = 'check.log' and log_text ~ $$\+ ERROR:  canceling 
> statement due to statement timeout$$;
>  sysname |      snapshot      
> ---------+---------------------
>  fennec  | 2008-03-06 01:10:02
>  heron   | 2008-03-11 18:06:01
>  heron   | 2008-03-11 23:06:01
>  heron   | 2008-04-20 00:06:01
>  cobra   | 2008-04-26 04:15:02
>  heron   | 2008-04-27 00:06:01
> (6 rows)

Hmm, were there any matches in non-HEAD tests?

The log_disconnect records in these tests all seem to indicate that the
"reset" really did take a full second --- the total session time is
2-seconds-plus rather than 1-second-plus as in a successful run.

Since prepared_xacts runs as one member of a 20-way parallel test,
I guess it's not out of the question that the machine might sometimes
be loaded to the point where that backend doesn't get to run for awhile.
This would be particularly true if the kernel scheduler is one that
penalizes processes that it's decided are "background" processes ---
which might explain why certain platforms seem to show this more than
others.

Also, I trolled the CVS logs from 2008-03-06 back to the 8.3 branch point
and couldn't find anything that seemed likely to correspond to a large
performance dropoff.

So I'm kind of inclined to write this off as a testing glitch and "fix"
it by bumping the timeout up to 2 seconds.  But it would be more
comforting if we'd seen it in back branches, or could spot some reason
why the test would be more likely to fail this way than in the past.
The prepared_xacts test itself hasn't changed materially since it was
put in...
        regards, tom lane


Re: Recent buildfarm failures involving statement_timeout

From
Andrew Dunstan
Date:

Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> pgbfprod=# select sysname, snapshot from build_status_log where branch = 
>> 'HEAD' and log_stage = 'check.log' and log_text ~ $$\+ ERROR:  canceling 
>> statement due to statement timeout$$;
>>  sysname |      snapshot      
>> ---------+---------------------
>>  fennec  | 2008-03-06 01:10:02
>>  heron   | 2008-03-11 18:06:01
>>  heron   | 2008-03-11 23:06:01
>>  heron   | 2008-04-20 00:06:01
>>  cobra   | 2008-04-26 04:15:02
>>  heron   | 2008-04-27 00:06:01
>> (6 rows)
>>     
>
> Hmm, were there any matches in non-HEAD tests?
>   
sysname  |      snapshot       |    branch    
---------+---------------------+---------------lionfish | 2007-04-19 09:30:27 | REL8_2_STABLElionfish | 2007-05-29
23:30:07| REL8_1_STABLElionfish | 2007-09-22 23:30:07 | REL8_1_STABLE
 



cheers

andrew


Re: Recent buildfarm failures involving statement_timeout

From
Stefan Kaltenbrunner
Date:
Andrew Dunstan wrote:
> 
> 
> Tom Lane wrote:
>> Andrew Dunstan <andrew@dunslane.net> writes:
>>  
>>> pgbfprod=# select sysname, snapshot from build_status_log where 
>>> branch = 'HEAD' and log_stage = 'check.log' and log_text ~ $$\+ 
>>> ERROR:  canceling statement due to statement timeout$$;
>>>  sysname |      snapshot      ---------+---------------------
>>>  fennec  | 2008-03-06 01:10:02
>>>  heron   | 2008-03-11 18:06:01
>>>  heron   | 2008-03-11 23:06:01
>>>  heron   | 2008-04-20 00:06:01
>>>  cobra   | 2008-04-26 04:15:02
>>>  heron   | 2008-04-27 00:06:01
>>> (6 rows)
>>>     
>>
>> Hmm, were there any matches in non-HEAD tests?
>>   
> 
> sysname  |      snapshot       |    branch   
> ---------+---------------------+---------------
> lionfish | 2007-04-19 09:30:27 | REL8_2_STABLE
> lionfish | 2007-05-29 23:30:07 | REL8_1_STABLE
> lionfish | 2007-09-22 23:30:07 | REL8_1_STABLE

lionfish is(or rather was - powersupply got fried and I have no 
replacement yet) one of the slowest machines(250MHZ mipsel,48MB RAM) on 
the buildfarm (took about 5-6h for a complete run).


Stefan


Re: Recent buildfarm failures involving statement_timeout

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> sysname  |      snapshot       |    branch   
>> ---------+---------------------+---------------
>> lionfish | 2007-04-19 09:30:27 | REL8_2_STABLE
>> lionfish | 2007-05-29 23:30:07 | REL8_1_STABLE
>> lionfish | 2007-09-22 23:30:07 | REL8_1_STABLE

> lionfish is(or rather was - powersupply got fried and I have no 
> replacement yet) one of the slowest machines(250MHZ mipsel,48MB RAM) on 
> the buildfarm (took about 5-6h for a complete run).

Unless someone has a better idea, I suggest we just raise the timeouts
used in prepared_xacts from 1 sec to 2 sec.  That should reduce the
recurrence rate to something vanishingly small if the problem is just
"it's too slow".  If it's that something's actually getting stuck, then
we'll continue to see failures and will know that closer investigation
is needed.
        regards, tom lane