Thread: Re: Robocopy might be not robust enough for never-ending testing on Windows
Re: Robocopy might be not robust enough for never-ending testing on Windows
From
Andrew Dunstan
Date:
On 2024-09-14 Sa 9:00 AM, Alexander Lakhin wrote: > Hello hackers, > > While trying to reproduce inexplicable drongo failures (e. g., [1]) > seemingly related to BackgroundPsql, I stumbled upon close, but not > the same issue. After many (6-8 thousands) iterations of the > 015_stream.pl TAP test, psql failed to start with a > STATUS_DLL_INIT_FAILED > error, and a corresponding Windows popup preventing a test exit (see ss-1 > in the archive attached). > > Upon reaching that state of the system, following test runs fail with one > or another error related to memory (not only with psql, but also with the > server processes): > testrun/subscription_13/015_stream/log/015_stream_publisher.log:2024-09-11 > 20:01:51.777 PDT [8812] LOG: server process (PID 11532) was > terminated by exception 0xC00000FD > testrun/subscription_14/015_stream/log/015_stream_subscriber.log:2024-09-11 > 20:01:19.806 PDT [2036] LOG: server process (PID 10548) was > terminated by exception 0xC00000FD > testrun/subscription_16/015_stream/log/015_stream_publisher.log:2024-09-11 > 19:59:41.513 PDT [9128] LOG: server process (PID 14476) was > terminated by exception 0xC0000409 > testrun/subscription_19/015_stream/log/015_stream_publisher.log:2024-09-11 > 20:03:27.801 PDT [10156] LOG: server process (PID 2236) was > terminated by exception 0xC0000409 > testrun/subscription_20/015_stream/log/015_stream_publisher.log:2024-09-11 > 19:59:41.359 PDT [10656] LOG: server process (PID 14712) was > terminated by exception 0xC000012D > testrun/subscription_3/015_stream/log/015_stream_publisher.log:2024-09-11 > 20:02:23.815 PDT [13704] LOG: server process (PID 13992) was > terminated by exception 0xC00000FD > testrun/subscription_9/015_stream/log/015_stream_publisher.log:2024-09-11 > 19:59:41.360 PDT [9760] LOG: server process (PID 11608) was > terminated by exception 0xC0000142 > ... > > When tests fail, I see Commit Charge reaching 100% (see ss-2 in the > attachment), while Physical Memory isn't all used up. To get OS to a > working state, I had to reboot it — killing processes, logoff/logon > didn't > help. > > I reproduced this issue again, investigated it and found out that it is > caused by robocopy (used by PostgreSQL::Test::Cluster->init), which is > leaking kernel objects, namely "Event objects" within Non-Paged pool on > each run. > > This can be seen with Kernel Pool Monitor, or just with this simple PS > script: > for ($i = 1; $i -le 100; $i++) > { > echo "iteration $i" > rm -r c:\temp\target > robocopy.exe /E /NJH /NFL /NDL /NP c:\temp\initdb-template c:\temp\target > Get-WmiObject -Class Win32_PerfRawData_PerfOS_Memory | % > PoolNonpagedBytes > } > > It shows to me: > iteration 1 > Total Copied Skipped Mismatch FAILED Extras > Dirs : 27 27 0 0 0 0 > Files : 968 968 0 0 0 0 > Bytes : 38.29 m 38.29 m 0 0 0 0 > Times : 0:00:00 0:00:00 0:00:00 0:00:00 > ... > 1226063872 > ... > iteration 100 > Total Copied Skipped Mismatch FAILED Extras > Dirs : 27 27 0 0 0 0 > Files : 968 968 0 0 0 0 > Bytes : 38.29 m 38.29 m 0 0 0 0 > Times : 0:00:00 0:00:00 0:00:00 0:00:00 > ... > 1245220864 > > (That is, 0.1-0.2 MB leaks per one robocopy run.) > > I observed this on Windows 10 (Version 10.0.19045.4780), with all updates > installed, but not on Windows Server 2016 (10.0.14393.0). Moreover, using > robocopy v14393 on Windows 10 doesn't affect the issue. > > Perhaps this information can be helpful for someone who is running > buildfarm/CI tests on Windows animals... > > [1] > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-09-11%2007%3A24%3A53 > > Interesting, good detective work. Still, wouldn't this mean drongo would fail consistently? I wonder why we're using robocopy instead of our own RecursiveCopy module? cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Robocopy might be not robust enough for never-ending testing on Windows
From
Alexander Lakhin
Date:
Hello Andrew, 14.09.2024 17:22, Andrew Dunstan wrote: > > On 2024-09-14 Sa 9:00 AM, Alexander Lakhin wrote: >> While trying to reproduce inexplicable drongo failures (e. g., [1]) >> seemingly related to BackgroundPsql, I stumbled upon close, but not >> the same issue. ... >> > > Interesting, good detective work. Still, wouldn't this mean drongo would fail consistently? Yes, I think drongo is suffering from another disease, we're yet to find which. > > I wonder why we're using robocopy instead of our own RecursiveCopy module? > AFAICS, robocopy is also used by regress.exe, so switching to the perl module would require perl even for regress tests. I know that perl was a requirement for MSVC builds, but maybe that's not so with meson... Best regards, Alexander