Hi,
On 2022-01-12 14:34:00 -0500, Andrew Dunstan wrote:
> For some considerable time the recovery tests have been at best flaky on
> Windows, and at worst disastrous (i.e. they can hang rather than just
> fail). It's a problem I worked around on my buildfarm animals by
> disabling the tests, hoping to find time to get back to analysing the
> problem. But now we are seeing failures on the cfbot too (e.g.
> https://cirrus-ci.com/task/5860692694663168 and
> https://cirrus-ci.com/task/5316745152954368 ) so I think we need to
> spend some effort on finding out what's going on here.
I'm somewhat certain that this is caused by assertions or aborts hanging with
a GUI popup, e.g. due to a check in the CRT.
I saw these kind of hangs a lot in the aio development tree before I merged
the changes to change error/abort handling on windows. Before the recent CI
changes cfbot ran windows tests without assertions, which - besides just
running fewer tests - explains having fewer such hang before, because there's
more sources of such error popups in the debug CRT.
It'd be nice if somebody could look at the patch and discussion in
https://www.postgresql.org/message-id/20211005193033.tg4pqswgvu3hcolm%40alap3.anarazel.de
The debugging information for the cirrus-ci tasks has a list of
processes. E.g. for https://cirrus-ci.com/task/5860692694663168 there's
1 agent.exe
1 CExecSvc.exe
1 csrss.exe
1 fontdrvhost.exe
1 lsass.exe
1 msdtc.exe
1 psql.exe
1 services.exe
1 wininit.exe
9 cmd.exe
9 perl.exe
9 svchost.exe
49 postgres.exe
processes.
So we know that some tests were actually still in progress... It's
particularly interesting that there's a psql process still hanging around...
Before I "simplified" that away, the CI patch ran all tests with a shorter
individual timeout than the overall CI timeout, so we'd see error logs
etc. Perhaps that was a mistake to remove. IIRC I did something like
"C:\Program Files\Git\usr\bin\timeout.exe" -v --kill-after=35m 30m perl path/to/vcregress.pl ...
Perhaps worth re-adding?
Greetings,
Andres Freund