Thread: Re: Robocopy might be not robust enough for never-ending testing on Windows

On Sun, Sep 15, 2024 at 1:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> (That is, 0.1-0.2 MB leaks per one robocopy run.)
>
> I observed this on Windows 10 (Version 10.0.19045.4780), with all updates
> installed, but not on Windows Server 2016 (10.0.14393.0). Moreover, using
> robocopy v14393 on Windows 10 doesn't affect the issue.

I don't understand Windows but that seems pretty weird to me, as it
seems to imply that a driver or something fairly low level inside the
kernel is leaking objects (at least by simple minded analogies to
operating systems I understand better).  Either that or robocop.exe
has userspace stuff involving at least one thread still running
somewhere after it's exited, but that seems unlikely as I guess you'd
have noticed that...

Just a thought: I was surveying the block cloning landscape across
OSes and filesystems while looking into clone-based CREATE DATABASE
(CF #4886) and also while thinking about the new TAP test initdb
template copy trick, is that robocopy.exe tries to use Windows' block
cloning magic, just like cp on recent Linux and FreeBSD systems (at
one point I was wondering if that was causing some funky extra flush
stalls on some systems, I need to come back to that...).  It probably
doesn't actually work unless you have Windows 11 kernel with DevDrive
enabled (from reading, no Windows here), but I guess it still probably
uses the new system interfaces, probably something like CopyFileEx().
Does it still leak if you use /nooffload or /noclone?



On Sun, Sep 15, 2024 at 8:32 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> template copy trick, is that robocopy.exe tries to use Windows' block

Erm, sorry for my fumbled early morning message editing and garbled
English... I meant to write "one thing I noticed is that..".



Re: Robocopy might be not robust enough for never-ending testing on Windows

From
Alexander Lakhin
Date:
Hello Thomas,

14.09.2024 23:32, Thomas Munro wrote:
> On Sun, Sep 15, 2024 at 1:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
>> (That is, 0.1-0.2 MB leaks per one robocopy run.)
>>
>> I observed this on Windows 10 (Version 10.0.19045.4780), with all updates
>> installed, but not on Windows Server 2016 (10.0.14393.0). Moreover, using
>> robocopy v14393 on Windows 10 doesn't affect the issue.
> I don't understand Windows but that seems pretty weird to me, as it
> seems to imply that a driver or something fairly low level inside the
> kernel is leaking objects (at least by simple minded analogies to
> operating systems I understand better).  Either that or robocop.exe
> has userspace stuff involving at least one thread still running
> somewhere after it's exited, but that seems unlikely as I guess you'd
> have noticed that...

Yes, I see no robocopy process left after the test, and I think userspace
threads would not survive logoff.

> Just a thought: I was surveying the block cloning landscape across
> OSes and filesystems while looking into clone-based CREATE DATABASE
> (CF #4886) and also while thinking about the new TAP test initdb
> template copy trick, is that robocopy.exe tries to use Windows' block
> cloning magic, just like cp on recent Linux and FreeBSD systems (at
> one point I was wondering if that was causing some funky extra flush
> stalls on some systems, I need to come back to that...).  It probably
> doesn't actually work unless you have Windows 11 kernel with DevDrive
> enabled (from reading, no Windows here), but I guess it still probably
> uses the new system interfaces, probably something like CopyFileEx().
> Does it still leak if you use /nooffload or /noclone?

I tested the following (with the script above):
Windows 10 (Version 10.0.19045.4780):
robocopy.exe (10.0.19041.4717) /NOOFFLOAD
iteration 1
496611328
...
iteration 1000
609701888

That is, it leaks

/NOCLONE is not supported by that robocopy version:
ERROR : Invalid Parameter #1 : "/NOCLONE"

Then, Windows 11 (Version 10.0.22000.613), robocopy 10.0.22000.469:
iteration 1
141217792
...
iteration 996
151670784
...
iteration 997
152817664
...
iteration 1000
151674880

That is, it doesn't leak.

robocopy.exe /NOOFFLOAD
iteration 1
152666112
...
iteration 1000
153341952

No leak.

/NOCLONE is not supported by that robocopy version:

Then I updated that Windows 11 to Version 10.0.22000.2538 (with KB5031358),
robocopy 10.0.22000.1516:
iteration 1
122753024
...
iteration 1000
244674560

It does leak.

robocopy /NOOFFLOAD
iteration 1
167522304
...
iteration 1000
283484160

It leaks as well.

Finally, I've installed newest Windows 11 Version 10.0.22631.4169, with
robocopy 10.0.22621.3672:
Non-paged pool increased from 133 to 380 MB after 1000 robocopy runs.

robocopy /OFFLOAD leaks too.

/NOCLONE is not supported by that robocopy version:

So this leak looks like a recent and still existing defect.

(Sorry for a delay, fighting with OS updates/installation took me a while.)

Best regards,
Alexander



On Mon, Sep 16, 2024 at 6:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> So this leak looks like a recent and still existing defect.

From my cartoon-like understanding of Windows, I would guess that if
event handles created by a program are leaked after it has exited, it
would normally imply that they've been duplicated somewhere else that
is still running (for example see the way that PostgreSQL's
dsm_impl_pin_segment() calls DuplicateHandle() to give a copy to the
postmaster, so that the memory segment continues to exist after the
backend exits), and if it's that, you'd be able to see the handle
count going up in the process monitor for some longer running process
somewhere (as seen in this report from the Chrome hackers[1]).  And if
it's not that, then I would guess it would have to be a kernel bug
because something outside userspace must be holding onto/leaking
handles.  But I don't really understand Windows beyond trying to debug
PostgreSQL at a distance, so my guesses may be way off.  If we wanted
to try to find a Windows expert to look at a standalone repro, does
your PS script work with *any* source directory, or is there something
about the initdb template, in which case could you post it in a .zip
file so that a non-PostgreSQL person could see the failure mode?

[1] https://randomascii.wordpress.com/2021/07/25/finding-windows-handle-leaks-in-chromium-and-others/



Re: Robocopy might be not robust enough for never-ending testing on Windows

From
Alexander Lakhin
Date:
Hello Thomas,

17.09.2024 04:01, Thomas Munro wrote:
> On Mon, Sep 16, 2024 at 6:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
>> So this leak looks like a recent and still existing defect.
>  From my cartoon-like understanding of Windows, I would guess that if
> event handles created by a program are leaked after it has exited, it
> would normally imply that they've been duplicated somewhere else that
> is still running (for example see the way that PostgreSQL's
> dsm_impl_pin_segment() calls DuplicateHandle() to give a copy to the
> postmaster, so that the memory segment continues to exist after the
> backend exits), and if it's that, you'd be able to see the handle
> count going up in the process monitor for some longer running process
> somewhere (as seen in this report from the Chrome hackers[1]).  And if
> it's not that, then I would guess it would have to be a kernel bug
> because something outside userspace must be holding onto/leaking
> handles.  But I don't really understand Windows beyond trying to debug
> PostgreSQL at a distance, so my guesses may be way off.  If we wanted
> to try to find a Windows expert to look at a standalone repro, does
> your PS script work with *any* source directory, or is there something
> about the initdb template, in which case could you post it in a .zip
> file so that a non-PostgreSQL person could see the failure mode?
>
> [1] https://randomascii.wordpress.com/2021/07/25/finding-windows-handle-leaks-in-chromium-and-others/

That's very interesting reading. I'll try to research the issue that deep
later (though I guess this case is different — after logging off and
logging in as another user, I can't see any processes belonging to the
first one, while those "Event objects" in non-paged pool still occupy
memory), but finding a Windows expert who perhaps can look at the
robocopy's sources, would be good too (and more productive).

So, the repro we can show is:
rm -r c:\temp\source
mkdir c:\temp\source
for ($i = 1; $i -le 1000; $i++)
{
echo 1 > "c:\temp\source\$i"
}

for ($i = 1; $i -le 1000; $i++)
{
echo "iteration $i"
rm -r c:\temp\target
robocopy.exe /E /NJH /NFL /NDL /NP c:\temp\source c:\temp\target
Get-WmiObject -Class Win32_PerfRawData_PerfOS_Memory | % PoolNonpagedBytes
}

It produces for me (on Windows 10 [Version 10.0.19045.4780]):
iteration 1
...
216887296
...
iteration 1000


------------------------------------------------------------------------------

                Total    Copied   Skipped  Mismatch    FAILED Extras
     Dirs :         1         1         0         0         0 0
    Files :      1000      1000         0         0         0 0
    Bytes :     7.8 k     7.8 k         0         0         0 0
    Times :   0:00:00   0:00:00                       0:00:00 0:00:00


    Speed :               17660 Bytes/sec.
    Speed :               1.010 MegaBytes/min.
    Ended : Monday, September 16, 2024 8:58:09 PM

365080576

Just "touch c:\temp\source\$i" is not enough, files must be non-empty for
the leak to happen.

Best regards,
Alexander



On Tue, Sep 17, 2024 at 5:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> but finding a Windows expert who perhaps can look at the
> robocopy's sources, would be good too (and more productive).

number_of_kernel_bugs_found_by_pgsql_hackers++;

I am reliably informed that a kernel bug fix is in the pipeline and
should reach Windows 10 in ~3 months.