Maybe BF "timedout" failures are the client script's fault? - Mailing list pgsql-hackers

From Tom Lane
Subject Maybe BF "timedout" failures are the client script's fault?
Date
Msg-id 2423164.1767991263@sss.pgh.pa.us
Whole thread Raw
Responses Re: Maybe BF "timedout" failures are the client script's fault?
List pgsql-hackers
We've been assuming that all the "timedout" failures on BF member
fruitcrow were due to some wonkiness in the GNU/Hurd platform.
I got suspicious about that though after noticing that there are
a small number of such failures on other animals, eg [1][2][3].
In each case, the failure message claims it waited a good long
time, which is at variance with the actually observed runtime.
For instance [1] says "timed out after 14400 secs", but the
actual total test runtime is only 01:24:28 according to the
summary at the top of the page.

Looking into the buildfarm client, I realized that it's assuming that
"sleep($wait_time)" is sufficient to wait for $wait_time seconds.
However, the Perl docs point out that sleep() can be interrupted by a
signal.  So now I'm suspicious that many of these failures are caused
by a stray signal waking up the wait_timeout thread prematurely.
GNU/Hurd might just be more prone to that than other platforms.

I propose the attached patch to the BF client to try to make this
more robust.

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=ovenbird&dt=2025-11-14%2009%3A21%3A05
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2025-10-17%2018%3A32%3A07
[3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=opaleye&dt=2026-01-08%2023%3A07%3A37

--- run_build.pl.orig    2025-11-25 07:47:25
+++ run_build.pl    2026-01-09 15:02:23
@@ -3415,7 +3415,13 @@ sub wait_timeout
         $SIG{$sig} = 'DEFAULT';
     }
     $SIG{'TERM'} = \&silent_terminate;
-    sleep($wait_time);
+    # loop to absorb any unexpected signals without dying early
+    my $end_time = time + $wait_time;
+    while (time < $end_time)
+    {
+        my $delay = $end_time - time;
+        sleep($delay);
+    }
     print STDERR "Run timed out, terminating.\n";
     my $sig = $usig[0] || 'TERM';
     kill $sig, $main_pid;

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Stack-based tracking of per-node WAL/buffer usage
Next
From: David Geier
Date:
Subject: Re: Reduce build times of pg_trgm GIN indexes