On Sun, Oct 08, 2023 at 05:48:55PM -0400, Tom Lane wrote:
> There have been intermittent failures on various buildfarm machines
> since this went in. After seeing one on my own animal mamba [1],
> I tried to reproduce it manually on that machine, and it does
> indeed fail about one time in two. The buildfarm script is not
> managing to capture the relevant log files, but what I see in a
> manual run is that 001_worker_spi.pl logs this:
Thanks for the logs, I've noticed the failure but could not make any
sense of it based on the lack of information provided from the
buildfarm. Serinus has complained once, for instance.
> Since this only seems to happen on slow machines, I'd call it a timing
> problem or race condition. Unless you want to argue that the race
> should not happen, probably the fix is to make the test script cope
> with this worker_spi_launch() call failing. As long as we see the
> expected result from wait_for_log, we can be pretty sure the right
> thing happened.
The trick to reproduce the failure is to slow down worker_spi_launch()
before WaitForBackgroundWorkerStartup() with a worker already
registered so as the worker has the time to start and exit because of
the ALLOW_CONNECTIONS restriction. (SendPostmasterSignal() in
RegisterDynamicBackgroundWorker() interrupts a hardcoded sleep, so
I've just used an on-disk flag.)
Another thing is that we cannot rely on the PID returned by launch()
as it could fail, so $worker3_pid needs to disappear. If we do that,
I'd rather just switch to a specific database for the tests with
ALLOWCONN rather than reuse "mydb" that could have other workers. The
attached fixes the issue for me.
--
Michael