Thread: Re: Robocopy might be not robust enough for never-ending testing on Windows
On Sun, Sep 15, 2024 at 1:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: > (That is, 0.1-0.2 MB leaks per one robocopy run.) > > I observed this on Windows 10 (Version 10.0.19045.4780), with all updates > installed, but not on Windows Server 2016 (10.0.14393.0). Moreover, using > robocopy v14393 on Windows 10 doesn't affect the issue. I don't understand Windows but that seems pretty weird to me, as it seems to imply that a driver or something fairly low level inside the kernel is leaking objects (at least by simple minded analogies to operating systems I understand better). Either that or robocop.exe has userspace stuff involving at least one thread still running somewhere after it's exited, but that seems unlikely as I guess you'd have noticed that... Just a thought: I was surveying the block cloning landscape across OSes and filesystems while looking into clone-based CREATE DATABASE (CF #4886) and also while thinking about the new TAP test initdb template copy trick, is that robocopy.exe tries to use Windows' block cloning magic, just like cp on recent Linux and FreeBSD systems (at one point I was wondering if that was causing some funky extra flush stalls on some systems, I need to come back to that...). It probably doesn't actually work unless you have Windows 11 kernel with DevDrive enabled (from reading, no Windows here), but I guess it still probably uses the new system interfaces, probably something like CopyFileEx(). Does it still leak if you use /nooffload or /noclone?
On Sun, Sep 15, 2024 at 8:32 AM Thomas Munro <thomas.munro@gmail.com> wrote: > template copy trick, is that robocopy.exe tries to use Windows' block Erm, sorry for my fumbled early morning message editing and garbled English... I meant to write "one thing I noticed is that..".
Re: Robocopy might be not robust enough for never-ending testing on Windows
From
Alexander Lakhin
Date:
Hello Thomas, 14.09.2024 23:32, Thomas Munro wrote: > On Sun, Sep 15, 2024 at 1:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: >> (That is, 0.1-0.2 MB leaks per one robocopy run.) >> >> I observed this on Windows 10 (Version 10.0.19045.4780), with all updates >> installed, but not on Windows Server 2016 (10.0.14393.0). Moreover, using >> robocopy v14393 on Windows 10 doesn't affect the issue. > I don't understand Windows but that seems pretty weird to me, as it > seems to imply that a driver or something fairly low level inside the > kernel is leaking objects (at least by simple minded analogies to > operating systems I understand better). Either that or robocop.exe > has userspace stuff involving at least one thread still running > somewhere after it's exited, but that seems unlikely as I guess you'd > have noticed that... Yes, I see no robocopy process left after the test, and I think userspace threads would not survive logoff. > Just a thought: I was surveying the block cloning landscape across > OSes and filesystems while looking into clone-based CREATE DATABASE > (CF #4886) and also while thinking about the new TAP test initdb > template copy trick, is that robocopy.exe tries to use Windows' block > cloning magic, just like cp on recent Linux and FreeBSD systems (at > one point I was wondering if that was causing some funky extra flush > stalls on some systems, I need to come back to that...). It probably > doesn't actually work unless you have Windows 11 kernel with DevDrive > enabled (from reading, no Windows here), but I guess it still probably > uses the new system interfaces, probably something like CopyFileEx(). > Does it still leak if you use /nooffload or /noclone? I tested the following (with the script above): Windows 10 (Version 10.0.19045.4780): robocopy.exe (10.0.19041.4717) /NOOFFLOAD iteration 1 496611328 ... iteration 1000 609701888 That is, it leaks /NOCLONE is not supported by that robocopy version: ERROR : Invalid Parameter #1 : "/NOCLONE" Then, Windows 11 (Version 10.0.22000.613), robocopy 10.0.22000.469: iteration 1 141217792 ... iteration 996 151670784 ... iteration 997 152817664 ... iteration 1000 151674880 That is, it doesn't leak. robocopy.exe /NOOFFLOAD iteration 1 152666112 ... iteration 1000 153341952 No leak. /NOCLONE is not supported by that robocopy version: Then I updated that Windows 11 to Version 10.0.22000.2538 (with KB5031358), robocopy 10.0.22000.1516: iteration 1 122753024 ... iteration 1000 244674560 It does leak. robocopy /NOOFFLOAD iteration 1 167522304 ... iteration 1000 283484160 It leaks as well. Finally, I've installed newest Windows 11 Version 10.0.22631.4169, with robocopy 10.0.22621.3672: Non-paged pool increased from 133 to 380 MB after 1000 robocopy runs. robocopy /OFFLOAD leaks too. /NOCLONE is not supported by that robocopy version: So this leak looks like a recent and still existing defect. (Sorry for a delay, fighting with OS updates/installation took me a while.) Best regards, Alexander
On Mon, Sep 16, 2024 at 6:00 PM Alexander Lakhin <exclusion@gmail.com> wrote: > So this leak looks like a recent and still existing defect. From my cartoon-like understanding of Windows, I would guess that if event handles created by a program are leaked after it has exited, it would normally imply that they've been duplicated somewhere else that is still running (for example see the way that PostgreSQL's dsm_impl_pin_segment() calls DuplicateHandle() to give a copy to the postmaster, so that the memory segment continues to exist after the backend exits), and if it's that, you'd be able to see the handle count going up in the process monitor for some longer running process somewhere (as seen in this report from the Chrome hackers[1]). And if it's not that, then I would guess it would have to be a kernel bug because something outside userspace must be holding onto/leaking handles. But I don't really understand Windows beyond trying to debug PostgreSQL at a distance, so my guesses may be way off. If we wanted to try to find a Windows expert to look at a standalone repro, does your PS script work with *any* source directory, or is there something about the initdb template, in which case could you post it in a .zip file so that a non-PostgreSQL person could see the failure mode? [1] https://randomascii.wordpress.com/2021/07/25/finding-windows-handle-leaks-in-chromium-and-others/
Re: Robocopy might be not robust enough for never-ending testing on Windows
From
Alexander Lakhin
Date:
Hello Thomas, 17.09.2024 04:01, Thomas Munro wrote: > On Mon, Sep 16, 2024 at 6:00 PM Alexander Lakhin <exclusion@gmail.com> wrote: >> So this leak looks like a recent and still existing defect. > From my cartoon-like understanding of Windows, I would guess that if > event handles created by a program are leaked after it has exited, it > would normally imply that they've been duplicated somewhere else that > is still running (for example see the way that PostgreSQL's > dsm_impl_pin_segment() calls DuplicateHandle() to give a copy to the > postmaster, so that the memory segment continues to exist after the > backend exits), and if it's that, you'd be able to see the handle > count going up in the process monitor for some longer running process > somewhere (as seen in this report from the Chrome hackers[1]). And if > it's not that, then I would guess it would have to be a kernel bug > because something outside userspace must be holding onto/leaking > handles. But I don't really understand Windows beyond trying to debug > PostgreSQL at a distance, so my guesses may be way off. If we wanted > to try to find a Windows expert to look at a standalone repro, does > your PS script work with *any* source directory, or is there something > about the initdb template, in which case could you post it in a .zip > file so that a non-PostgreSQL person could see the failure mode? > > [1] https://randomascii.wordpress.com/2021/07/25/finding-windows-handle-leaks-in-chromium-and-others/ That's very interesting reading. I'll try to research the issue that deep later (though I guess this case is different — after logging off and logging in as another user, I can't see any processes belonging to the first one, while those "Event objects" in non-paged pool still occupy memory), but finding a Windows expert who perhaps can look at the robocopy's sources, would be good too (and more productive). So, the repro we can show is: rm -r c:\temp\source mkdir c:\temp\source for ($i = 1; $i -le 1000; $i++) { echo 1 > "c:\temp\source\$i" } for ($i = 1; $i -le 1000; $i++) { echo "iteration $i" rm -r c:\temp\target robocopy.exe /E /NJH /NFL /NDL /NP c:\temp\source c:\temp\target Get-WmiObject -Class Win32_PerfRawData_PerfOS_Memory | % PoolNonpagedBytes } It produces for me (on Windows 10 [Version 10.0.19045.4780]): iteration 1 ... 216887296 ... iteration 1000 ------------------------------------------------------------------------------ Total Copied Skipped Mismatch FAILED Extras Dirs : 1 1 0 0 0 0 Files : 1000 1000 0 0 0 0 Bytes : 7.8 k 7.8 k 0 0 0 0 Times : 0:00:00 0:00:00 0:00:00 0:00:00 Speed : 17660 Bytes/sec. Speed : 1.010 MegaBytes/min. Ended : Monday, September 16, 2024 8:58:09 PM 365080576 Just "touch c:\temp\source\$i" is not enough, files must be non-empty for the leak to happen. Best regards, Alexander
On Tue, Sep 17, 2024 at 5:00 PM Alexander Lakhin <exclusion@gmail.com> wrote: > but finding a Windows expert who perhaps can look at the > robocopy's sources, would be good too (and more productive). number_of_kernel_bugs_found_by_pgsql_hackers++; I am reliably informed that a kernel bug fix is in the pipeline and should reach Windows 10 in ~3 months.