Thread: Re: Robocopy might be not robust enough for never-ending testing on Windows

Re: Robocopy might be not robust enough for never-ending testing on Windows

From

Andrew Dunstan

Date:

14 September, 20:22:11

On 2024-09-14 Sa 9:00 AM, Alexander Lakhin wrote:
> Hello hackers,
>
> While trying to reproduce inexplicable drongo failures (e. g., [1])
> seemingly related to BackgroundPsql, I stumbled upon close, but not
> the same issue. After many (6-8 thousands) iterations of the
> 015_stream.pl TAP test, psql failed to start with a 
> STATUS_DLL_INIT_FAILED
> error, and a corresponding Windows popup preventing a test exit (see ss-1
> in the archive attached).
>
> Upon reaching that state of the system, following test runs fail with one
> or another error related to memory (not only with psql, but also with the
> server processes):
> testrun/subscription_13/015_stream/log/015_stream_publisher.log:2024-09-11 
> 20:01:51.777 PDT [8812] LOG:  server process (PID 11532) was 
> terminated by exception 0xC00000FD
> testrun/subscription_14/015_stream/log/015_stream_subscriber.log:2024-09-11 
> 20:01:19.806 PDT [2036] LOG:  server process (PID 10548) was 
> terminated by exception 0xC00000FD
> testrun/subscription_16/015_stream/log/015_stream_publisher.log:2024-09-11 
> 19:59:41.513 PDT [9128] LOG:  server process (PID 14476) was 
> terminated by exception 0xC0000409
> testrun/subscription_19/015_stream/log/015_stream_publisher.log:2024-09-11 
> 20:03:27.801 PDT [10156] LOG:  server process (PID 2236) was 
> terminated by exception 0xC0000409
> testrun/subscription_20/015_stream/log/015_stream_publisher.log:2024-09-11 
> 19:59:41.359 PDT [10656] LOG:  server process (PID 14712) was 
> terminated by exception 0xC000012D
> testrun/subscription_3/015_stream/log/015_stream_publisher.log:2024-09-11 
> 20:02:23.815 PDT [13704] LOG:  server process (PID 13992) was 
> terminated by exception 0xC00000FD
> testrun/subscription_9/015_stream/log/015_stream_publisher.log:2024-09-11 
> 19:59:41.360 PDT [9760] LOG:  server process (PID 11608) was 
> terminated by exception 0xC0000142
> ...
>
> When tests fail, I see Commit Charge reaching 100% (see ss-2 in the
> attachment), while Physical Memory isn't all used up. To get OS to a
> working state, I had to reboot it — killing processes, logoff/logon 
> didn't
> help.
>
> I reproduced this issue again, investigated it and found out that it is
> caused by robocopy (used by PostgreSQL::Test::Cluster->init), which is
> leaking kernel objects, namely "Event objects" within Non-Paged pool on
> each run.
>
> This can be seen with Kernel Pool Monitor, or just with this simple PS 
> script:
> for ($i = 1; $i -le 100; $i++)
> {
> echo "iteration $i"
> rm -r c:\temp\target
> robocopy.exe /E /NJH /NFL /NDL /NP c:\temp\initdb-template c:\temp\target
> Get-WmiObject -Class Win32_PerfRawData_PerfOS_Memory | % 
> PoolNonpagedBytes
> }
>
> It shows to me:
> iteration 1
>                Total    Copied   Skipped  Mismatch    FAILED Extras
>     Dirs :        27        27         0         0         0 0
>    Files :       968       968         0         0         0 0
>    Bytes :   38.29 m   38.29 m         0         0         0 0
>    Times :   0:00:00   0:00:00                       0:00:00 0:00:00
> ...
> 1226063872
> ...
> iteration 100
>                Total    Copied   Skipped  Mismatch    FAILED Extras
>     Dirs :        27        27         0         0         0 0
>    Files :       968       968         0         0         0 0
>    Bytes :   38.29 m   38.29 m         0         0         0 0
>    Times :   0:00:00   0:00:00                       0:00:00 0:00:00
> ...
> 1245220864
>
> (That is, 0.1-0.2 MB leaks per one robocopy run.)
>
> I observed this on Windows 10 (Version 10.0.19045.4780), with all updates
> installed, but not on Windows Server 2016 (10.0.14393.0). Moreover, using
> robocopy v14393 on Windows 10 doesn't affect the issue.
>
> Perhaps this information can be helpful for someone who is running
> buildfarm/CI tests on Windows animals...
>
> [1] 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-09-11%2007%3A24%3A53
>
>

Interesting, good detective work. Still, wouldn't this mean drongo would 
fail consistently?

I wonder why we're using robocopy instead of our own RecursiveCopy module?


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Robocopy might be not robust enough for never-ending testing on Windows

From

Alexander Lakhin

Date:

14 September, 21:00:00

Hello Andrew,

14.09.2024 17:22, Andrew Dunstan wrote:
>
> On 2024-09-14 Sa 9:00 AM, Alexander Lakhin wrote:
>> While trying to reproduce inexplicable drongo failures (e. g., [1])
>> seemingly related to BackgroundPsql, I stumbled upon close, but not
>> the same issue. ...
>>
>
> Interesting, good detective work. Still, wouldn't this mean drongo would fail consistently?

Yes, I think drongo is suffering from another disease, we're yet to find
which.

>
> I wonder why we're using robocopy instead of our own RecursiveCopy module?
>

AFAICS, robocopy is also used by regress.exe, so switching to the perl
module would require perl even for regress tests. I know that perl was
a requirement for MSVC builds, but maybe that's not so with meson...

Best regards,
Alexander